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Preface 


This  report  constitutes  the  proceedings  of  the  second  Text  REtrieval  Conference  (TREC-2)  held  in  Gaithers- 
burg,  Maryland.  August  31-September  2,  1993.  The  conference  was  co-sponsored  by  the  National  Institute  of 
Standards  and  Technology  (NIST)  and  the  Advanced  Research  Projects  Agency  (ARPA),  and  was  attended  by 
150  people  involved  in  the  31  participating  groups.  The  conference  was  the  second  in  an  on-going  series  of 
workshops  to  evaluate  new  technologies  in  text  retrieval. 

The  workshop  included  plenary  sessions  and  twelve  discussion  groups.  Because  the  participants  in  the 
workshop  drew  on  their  personal  experiences,  they  sometimes  cited  specific  vendors  and  commercial  products. 
The  inclusion  or  omission  of  a  particular  company  or  product  does  not  imply  either  endorsement  or  criticism  by 
NIST. 

The  sponsorship  of  the  Software  and  Intelligent  Systems  Technology  Office  of  the  Advanced  Research  Pro- 
jects Agency  is  gratefully  acknowledged,  along  with  the  tremendous  work  of  the  program  committee. 

Donna  Haiman 
February  20.  1994 
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Abstract 


This  report  constitutes  the  proceedings  of  the  second  Text  REtrieval  Conference  (TREC-2)  held  in  Gaithers- 
burg,  Maryland,  August  31-September  2,  1993.  The  conference  was  co-sponsored  by  the  National  Institute  of 
Standards  and  Technology  (NIST)  and  the  Advanced  Research  Projects  Agency  (ARPA),  and  was  attended  by 
150  people  involved  in  the  31  participating  groups. 

The  goal  of  the  conference  was  to  bring  research  groups  together  to  discuss  their  work  on  a  new  large  test 
collection.  There  was  a  wide  variation  of  retrieval  techniques  reported  on,  including  methods  using  automatic 
thesaurii,  sophisticated  term  weighting,  natural  language  techniques,  relevance  feedback,  and  advanced  pattern 
matching.  As  results  had  been  run  through  a  common  evaluation  package,  groups  were  able  to  compare  the 
effectiveness  of  different  techniques,  and  discuss  how  differences  between  the  systems  affected  performance. 

The  conference  included  paper  sessions  and  discussion  groups.  This  proceedings  includes  papers  from  most 
of  the  participants  (several  poster  groups  did  not  submit  papers),  along  with  reports  from  some  of  the  discussion 
groups. 
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Overview  of  the  Second  Text  REtrieval  Conference  (TREC-2) 


Donna  Harman 

National  Institute  of  Standards  and  Technology 
Gaithersburg,  MD.  20899 


1.  Introduction 

In  November  of  1992  the  first  Text  REtrieval  Conference 
(TREC-1)  was  held  at  NIST  [Harman  1993].  The  confer- 
ence, co-sponsored  by  ARPA  and  NIST,  brought  together 
information  retrieval  researchers  to  discuss  their  system 
results  on  a  new  large  test  collection  (the  TIPSTER  col- 
lection). This  was  the  first  time  that  such  groups  had  ever 
compared  results  on  the  same  data  using  the  same  evalua- 
tion methods  and  represented  a  breakthrough  in  cross- 
system  evaluation  in  information  retrieval.  It  was  also  the 
first  time  that  most  of  these  groups  had  used  such  a  large 
test  collection  and  therefore  required  a  major  effort  by  all 
groups  to  scale  up  their  retrieval  techniques. 

The  overall  goal  of  the  TREC  initiative  is  to  encourage  re- 
search in  information  retrieval  using  large-scale  test  col- 
lections. It  is  hoped  that  by  providing  a  very  large  test 
collection,  and  encouraging  interaction  with  other  groups 
in  a  friendly  evaluation  forum,  new  momentum  in  infor- 
mation retrieval  wih  be  generated.  Because  of  the  NIST 
involvement,  groups  with  commercial  retrieval  products 
have  participated  in  TREC,  leading  to  increased  techno- 
logical transfer  between  the  research  labs  and  the  com- 
mercial products.  TREC  has  also  provided  a  state-of-the- 
art  showcase  of  retrieval  methods  for  ARPA  clients. 

Whereas  the  TREC-1  conference  demonstrated  a  wide 
range  of  different  approaches  to  the  retrieval  of  text  from 
large  document  collections,  the  resiilts  should  be  viewed 
as  very  preliminary.  Not  only  were  the  deadlines  for  re- 
sults very  tight,  but  the  huge  increase  in  the  size  of  the 
document  collection  required  significant  system  rebuild- 
ing by  most  groups.  Much  of  this  work  was  a  system  en- 
gineering task:  finding  reasonable  data  structures  to  use, 
getting  indexing  routines  to  be  efficient  enough  to  index 
all  the  data,  finding  enough  storage  to  handle  the  large  in- 
verted files  and  other  structures,  etc.  Still,  the  results 
showed  that  the  systems  did  the  task  well,  and  that  auto- 
matic construction  of  queries  from  the  topics  did  as  well 
as,  or  better  than,  manual  construction  of  queries. 

The  second  TREC  conference  (TREC-2)  occurred  m  Au- 
gust of  1993,  less  than  10  months  after  the  first  confer- 
ence. In  addition  to  22  of  the  TREC-1  groups,  nine  new 


groups  took  part,  brmging  the  total  number  of  participat- 
ing groups  to  31.  Many  of  the  original  TREC-1  groups 
were  able  to  "complete"  their  system  rebuilding  and  tun- 
ing, and  in  general  the  TREC-2  results  show  significant 
improvements  over  the  TREC-1  results. 

This  paper  provides  an  overview  of  the  TREC-2  confer- 
ence, including  a  review  of  the  TREC  task,  a  brief  de- 
scription of  the  test  collection  being  used,  and  an 
overview  of  the  results.  The  papers  from  the  individual 
groups  should  be  referred  to  for  more  details  on  specific 
system  approaches. 

2.  The  TREC  Task 

2.1  Introduction 

TREC  is  designed  to  encourage  research  in  information 
retrieval  using  large  data  collections.  Two  types  of  re- 
trieval are  being  examined  ~  retrieval  using  an  "adhoc" 
query  such  as  a  researcher  might  use  in  a  Hbrary  environ- 
ment, and  retrieval  using  a  "routing"  query  such  as  a  pro- 
file to  filter  some  incoming  document  stream.  The  TREC 
task  is  not  tied  to  any  given  application,  and  is  not  primar- 
ily concerned  with  interfaces  or  optimized  response  time 
for  searching.  However  it  is  helpful  to  have  some  poten- 
tial user  in  mind  when  designing  or  testing  a  retrieval  sys- 
tem. The  model  for  a  user  in  TREC  is  a  dedicated 
searcher,  not  a  novice  searcher,  and  the  model  for  the  ap- 
pUcation  is  one  needing  monitoring  of  data  streams  for  in- 
formation on  specific  topics  (routing),  and  the  ability  to 
do  adhoc  searches  on  archived  data  for  new  topics.  It 
should  be  assumed  that  the  users  need  the  ability  to  do 
both  high  precision  and  high  recall  searches,  and  are  will- 
ing to  look  at  many  documents  and  repeatedly  modify 
queries  in  order  to  get  high  recall.  Obviously  they  would 
like  a  system  that  makes  this  as  easy  as  possible,  but  this 
ease  should  be  reflected  in  TREC  as  added  intelligence  in 
the  system  rather  than  as  special  interfaces. 

Since  TREC  has  been  designed  to  evaluate  system  perfor- 
mance both  in  a  routing  (filtering  or  profiling)  mode,  and 
in  an  adhoc  mode,  both  functions  need  to  be  tested. 
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Documents 
(Disks  1  and  2) 


Test 
Documents 
(Disk  3) 


Figure  1.  The  TREC  Task. 

The  test  design  was  based  on  traditiaial  information 
retrieval  models,  and  evaluation  used  traditional  recall  and 
precision  measures.  The  above  diagram  of  the  test  design 
shows  the  various  components  of  TEIEC  (fig.  1). 

This  diagram  reflects  the  four  data  sets  (2  sets  of  topics 
and  2  sets  of  documents)  that  were  provided  to  partici- 
pants. These  data  sets  (along  with  a  set  of  sample  rele- 
vance judgments  for  the  100  training  topics)  were  used  to 
construct  three  sets  of  queries.  Ql  is  the  set  of  queries 
(probably  multiple  sets)  created  to  help  in  adjusting  a  sys- 
tem to  this  task,  to  create  better  weighting  algorithms,  and 
in  general  to  train  the  system  for  testing.  The  results  of 
this  research  were  used  to  create  Q2,  the  routing  queries 
to  be  used  against  the  test  documents.  Q3  is  the  set  of 
queries  created  from  the  test  topics  as  adhoc  queries  for 
searching  against  the  training  documents.  The  results 
from  searches  using  Q2  and  Q3  were  the  official  test 
results  sent  to  NIST. 


2.1  Specific  Task  Guidelines 

Because  the  TREC  participants  used  a  wide  variety  of 
indexing/knowledge  base  building  techniques,  and  a  wide 
variety  of  approaches  to  generate  search  queries,  it  was 
important  to  establish  clear  guidelines  for  the  evaluation 
task.  The  guidelines  deal  with  the  methods  of  index- 
ing/knowledge base  construction,  and  with  the  methods  of 
generating  the  queries  from  the  supplied  topics.  In  gen- 
eral, they  were  constructed  to  reflect  an  actual  operational 
environment,  and  to  aUow  as  fair  as  possible  a  separation 
among  the  diverse  query  construction  approaches. 


There  were  guidelines  for  constructing  and  manipulating 
the  system  data  structures.  These  structures  were  defined 
to  consist  of  the  original  documents,  any  new  structures 
built  automaticaUy  from  the  documents  (such  as  inverted 
files,  thesauri,  conceptual  networks,  etc.),  and  any  new 
structures  built  manually  from  the  documents  (such  as 
thesauri,  synonym  lists,  knowledge  bases,  rules,  etc.). 
The  following  guidelines  were  developed  for  the  TREC 
task. 

1.  System  data  structures  should  be  built  using  the 
initial  training  set  (documents  from  disks  1  and  2, 
training  topics  1-100,  and  the  relevance  judg- 
ments). They  may  be  modified  based  on  the  test 
documents  from  disk  3,  but  not  based  on  the  test 
topics. 

2.  There  are  parts  of  the  test  collection,  such  as  the 
Wall  Street  Journal  and  the  Ziff  material,  that  con- 
tain manuaUy  assigned  controlled  or  uncontrolled 
index  terms.  These  fields  are  delimited  by  SGML 
tags,  as  specified  in  the  documentation  files 
included  with  the  data.  Since  the  primary  focus  is 
on  retrieval  and  routing  of  naturally  occurring  text, 
these  manually  indexed  terms  should  not  be  used. 

3.  Special  care  should  be  used  in  handling  the  rout- 
ing task.  In  a  true  routing  situation,  a  single  docu- 
ment would  be  indexed  and  compared  against  the 
routing  topics.  Since  the  test  docxmients  are  gen- 
erally indexed  as  a  complete  set,  routing  should  be 
simulated  by  not  using  any  information  based  on 
the  fuU  set  of  test  documents  (such  as  weighting 
based  on  the  test  collection,  total  frequency  based 
on  the  test  collection,  etc.)  in  the  searching.  It  is 
permissible  to  use  training-set  collection  informa- 
tion however. 

Additionally  there  were  guidelines  for  constructing  the 
queries  from  the  provided  topics.  These  guideUnes  were 
considered  of  great  importance  for  fair  system  compari- 
son and  were  therefore  carefully  constructed.  Three 
generic  categories  were  defined,  based  on  the  amount  and 
kind  of  manual  intervention  used. 

1.  AUTOMATIC  (completely  automatic  initial  query 
construction) 

adhoc  queries  -  The  system  will  automatically 
extract  information  from  the  topic  to  construct  the 
query.  The  query  will  then  be  submitted  to  the  sys- 
tem (with  no  manual  modifications)  and  the  results 
from  the  system  will  be  the  results  submitted  to 
NIST.  There  should  be  no  manual  intervention 
that  would  affect  the  results. 

routing  queries   ~   The   queries   should  be 
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constructed  automatically  using  the  training  top- 
ics, the  training  relevance  judgments  and  the  train- 
ing documents.  The  queries  should  then  be  sub- 
mitted to  NIST  before  the  test  documents  are 
released  and  should  not  be  modified  after  that 
point.  The  unmodified  queries  should  be  run 
against  the  test  docimients  and  the  results  submit- 
ted to  NIST. 

2.  MANUAL  (manual  initial  query  construction) 

adhoc  queries  ~  The  query  is  constructed  in  some 
maimer  from  the  topic,  either  manually  or  using 
machine  assistance.  Once  the  query  has  been  con- 
structed, it  will  be  submitted  to  the  system  (with 
no  manual  intervention),  and  die  results  from  die 
system  will  be  the  results  submitted  to  NIST. 
There  should  be  no  manual  intervention  after  Loi- 
tial  query  construction  that  would  affect  the 
results.  (Manual  intervention  is  covered  by  the  cat- 
egory labelled  FEEDBACK.) 

routing  queries  ~  The  queries  should  be  con- 
structed in  the  same  manner  as  the  adhoc  queries 
for  MANUAL,  but  using  the  training  topics,  rele- 
vance judgments,  and  training  documents.  They 
should  then  be  submitted  to  NIST  before  the  test 
documents  are  released  and  should  not  be  modi- 
fied after  diat  point.  The  unmodified  queries 
should  be  run  against  the  test  documents  and  the 
results  submitted  to  NIST. 

3.  FEEDBACK  (automatic  or  manual  query  con- 
struction with  feedback) 

adhoc  queries  ~  The  initial  query  can  be  con- 
structed using  either  AUTOMATIC  or  MANUAL 
methods.  The  query  is  submitted  to  the  system, 
and  a  subset  of  the  retrieved  documents  is  used  for 
manual  feedback,  i.e.,  a  human  makes  judgments 
about  the  relevance  of  the  docvunents  in  this  sub- 
set. These  judgments  may  be  communicated  to 
the  system,  which  may  automatically  modify  die 
query,  or  the  human  may  simply  choose  to  modify 
the  query  himself.  At  some  point,  feedback 
should  end,  and  the  query  should  be  accepted  as 
final.  Systems  that  submit  runs  using  this  method 
must  submit  several  different  sets  of  results  to 
allow  tracking  of  the  time/cost  benefit  of  doing 
relevance  feedback. 

routing  queries  —  FEEDBACK  cannot  be  used  for 
routing  queries  as  routing  systems  have  not  sup- 
ported feedback. 


22  The  Participants 

There  were  31  participating  systems  in  TREC-2,  using  a 
wide  range  of  retrieval  techniques.  The  participants  were 
able  to  choose  from  three  levels  of  participation:  Cate- 
gory A,  full  participation.  Category  B,  full  participation 
using  a  reduced  dataset  (1/4  of  the  full  document  set),  and 
Category  C  for  evaluation  only  (to  allow  commercial  sys- 
tems to  protect  proprietary  algorithms).  The  program 
committee  selected  only  20  category  A  and  B  groups  to 
present  talks  because  of  limited  conference  time,  and 
requested  that  the  rest  of  the  groups  present  posters.  All 
groups  were  asked  to  submit  papers  for  the  proceedings. 

Each  group  was  provided  die  data  and  asked  to  turn  in 
either  one  or  two  sets  of  results  for  each  topic.  When  two 
sets  of  results  were  sent,  they  could  be  made  using  differ- 
ent methods  of  creating  queries  (AUTOMAIIC,  MAN- 
UAL, or  FEEDBACK),  or  by  using  different  parameter 
settings  for  one  query  creation  method.  Groups  could 
choose  to  do  die  routing  task,  the  adhoc  task,  or  both,  and 
were  requested  to  submit  die  top  1000  documents 
retrieved  for  each  topic  for  evaluation. 

3.  The  Test  Collection 

3.1  Introduction 

The  creation  of  the  test  collection  (called  the  TIPSTER 
collection)  was  critical  to  die  success  of  TREC.  Like 
most  traditional  retrieval  collections,  there  are  three  dis- 
tinct parts  to  this  collection  —  the  documents,  the  queries 
or  topics,  and  the  relevance  judgments  or  "right  answers." 
These  test  collection  components  are  discussed  briefly  in 
the  rest  of  this  section.  For  a  more  complete  description 
of  the  collection,  see  [Harman  1994]. 

32  The  Documents 

The  documents  needed  to  mirror  the  different  types  of 
documents  used  in  die  dieoretical  TREC  appUcation. 
Specifically  they  had  to  have  a  varied  length,  a  varied 
writing  style,  a  varied  level  of  editing  and  a  varied  vocab- 
ulary. As  a  final  requirement,  die  documents  had  to  cover 
difrierent  timeframes  to  show  the  effects  of  document  date 
on  the  routing  task. 

The  documents  were  distributed  as  CD-ROMs  witii  about 
1  gigabyte  of  data  each,  compressed  to  fit.  The  following 
shows  die  actual  contents  of  each  disk. 

Diskl 

.  WSJ -Wall  Street  Journal  (1981, 1988. 1989) 
.  AP  -  AP  Newswire  (1989) 
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•  ZIFF  --  Articles  from  Computer  Select  disks  (Ziff- 
Davis  Publishing) 

•  FR  --  Federal  Register  (1989) 

•  DOE  --  Short  abstracts  from  DOE  publications 

Disk  2 

.  V^SZ -Wall  Street  Journal  {1990, 1991, 1992) 
.  AP    AP  Newswire  (1988) 

•  ZIFF  -  Articles  from  Computer  Select  disks  (Ziff- 
Davis  Publishing) 

•  FR  ~  Federal  Register  (1988) 
Disks 

•  SJMN  "  San  Jose  Mercury  News  (1991) 
'  AP-AP  Newswire  (1990) 

•  ZIFF  ~  Articles  from  Computer  Select  disks  (Ziff- 
Davis  Publishing) 

.  PAT  ~  U.S.  Patents  (1993) 

The  documents  are  uniformly  formatted  into  an  SGML- 
like  structure,  as  can  be  seen  in  the  following  example. 

<DOC> 

<DOCNO>  WSJ880406-0090  <IDOCNO> 
<HL>  AT &T  Unveils  Services  to  Upgrade  Phone  Net- 
works Under  Global  Plan  <IHL> 
<AUTHOR>  Janet  Guyon  (WSJ  Staff)  </AUTHOR> 
<DATEUNE>  NEW  YORK  <IDArEUNE> 
<TEXT> 

American  Telephone  &.  Telegraph  Co.  introduced  the 
first  of  a  new  generation  of  phone  services  with  broad 
implications  for  computer  and  communications  equip- 
ment markets. 

AT &T  said  it  is  the  first  national  long-distance  car- 
rier to  announce  prices  for  specific  services  under  a 
world-wide  standardization  plan  to  upgrade  phone  net- 
works. By  announcing  commercial  services  under  the 
plan,  which  the  industry  calls  the  Integrated  Services 
Digital  Network,  AT &T  will  influence  evolving  commu- 
nications standards  to  its  advantage,  consultants  said, 
just  as  International  Business  Machines  Corp.  has  cre- 
ated de  facto  computer  standards  favoring  its  products. 


</TEXT> 
<iDOC> 

All  documents  have  beginning  and  end  markers,  and  a 
imique  DOCNO  id  field.  Additionally  other  fields  taken 


from  the  initial  data  appear,  but  these  vary  widely  across 
the  different  sources.  The  documents  have  differing 
amoimts  of  errors,  which  were  not  checked  or  corrected. 
Not  only  would  this  have  been  an  impossible  task,  but  the 
errors  in  the  data  provide  a  better  simulation  of  the  TREC 
task.  Errors  in  missing  document  separators  or  bad  docu- 
ment numbers  were  screened  out,  although  a  few  were 
missed  and  later  reported  as  errors. 

Table  1  shows  some  basic  document  collection  statistics. 
Note  that  although  the  collection  sizes  are  roughly  equiv- 
alent in  megabytes,  there  is  a  range  of  document  lengths 
from  very  short  dociraients  (DOE)  to  very  long  (FR). 
Also  the  range  of  document  lengths  within  a  collection 
varies.  For  example,  the  documents  from  AP  are  similar 
in  length  (the  median  and  the  average  length  are  very 
close),  but  the  WSJ  and  ZIFF  documents  have  a  wider 
range  of  lengths.  The  documents  from  the  Federal  Regis- 
ter (FR)  have  a  very  wide  range  of  lengths. 

3.3  The  Topics 

In  designing  the  TREC  task,  there  was  a  conscious  deci- 
sion made  to  provide  "user  need"  statements  rather  than 
more  traditional  queries.  Two  major  issues  were  involved 
in  this  decision.  First  there  was  a  desire  to  allow  a  wide 
range  of  query  construction  methods  by  keeping  the  topic 
(the  need  statement)  distmct  from  the  query  (the  actual 
text  submitted  to  the  system).  The  second  issue  was  the 
abihty  to  increase  the  amount  of  information  available 
about  each  topic,  in  particular  to  include  with  each  topic  a 
clear  statement  of  what  criteria  make  a  document  relevant. 

The  topics  were  designed  to  munic  a  real  user's  need,  and 
were  written  by  people  who  are  actual  users  of  a  retrieval 
system.  Although  the  subject  domain  of  the  topics  was 
diverse,  some  consideration  was  given  to  the  documents 
to  be  searched.  The  topics  were  constructed  by  doing  trial 
retrievals  against  a  sample  of  the  document  set,  and  then 
those  topics  that  had  roughly  25  to  100  hits  in  that  sample 
were  used.  This  created  a  range  of  broader  and  narrower 
topics. 

The  following  is  one  of  the  topics  used  in  TREC. 
<top> 

<head>  Tipster  Topic  Description 
<nwn>  Number:  066 
<dom>  Domain:  Science  and  Technology 
<title>  Topic:  Natural  Language  Processing 

<desc>  Description: 

Document  will  identify  a  type  of  natural  language  pro- 
cessing technology  which  is  being  developed  or  mar- 
keted in  the  U.S. 
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Table  1.  Document  Statistics 


OUUSCl  UI  CUilCCUUU 

WST  <'Hkk<!  1  flnH  91 

AP 

7TFP 
X  .III 

MTISl  V,UIoKo  X  OIlU 

STMN  CHisk  '^^ 

PAT  CHkW 

Size  of  collection 

^megauyics^ 

951 

958 

i9n 

2SS 

188 

91 1 

31S 

248 

^58 

951 

Number  of  records 

84  9^0 

7S  ISO 

/  J,  low 

96  907 

996  087 

("disk  2) 

74,520 

79,923 

56,920 

20.108 

("disk  ^1 

90  257 

78  325 

161,021 

6  711 

Median  number  of 

terms  per  record 

J  J  J 

181 

89 

(disk  2) 

218 

346 

167 

315 

(disk  3) 

279 

358 

119 

2896 

Average  number  of 

terms  per  record 

(diskl) 

329 

375 

412 

1017 

89 

(disk  2) 

377 

370 

394 

1073 

(disk  3) 

337 

379 

263 

3543 

<smry>  Summary: 

Document  will  identify  a  type  of  natural  language  pro- 
cessing technology  which  is  being  developed  or  mar- 
keted in  the  U.S. 

<narr>  Narrative: 

A  relevant  document  will  identify  a  company  or  institu- 
tion developing  or  marketing  a  natural  language  pro- 
cessing technology,  identify  the  technology,  and  identify 
one  or  more  features  of  the  company's  product. 

<con>  Concept(s): 

1 .  natural  language  processing 

2.  translation,  language,  dictionary,  font 

3.  software  applications 

<fac>  Factor (s): 
<nat>  Nationality:  U.S. 
</fac> 

<def>  Definition(s): 
</top> 

Each  topic  is  formatted  in  the  same  standard  method  to 
allow  easier  automatic  construction  of  queries.  Besides  a 
beginning  and  an  end  marker,  each  topic  has  a  number,  a 
short  title,  a  one-sentence  description,  and  a  summary 
sentence  or  two  that  can  be  used  as  a  surrogate  for  the  full 
topic  (often  very  similar  to  the  one-sentence  description). 
There  is  a  narrative  section  which  is  aimed  at  providing  a 
complete  description  of  document  relevance  for  the 


assessors.  Each  topic  also  has  a  concepts  section  with  a 
list  of  assorted  concepts  related  to  the  topic.  This  section 
is  designed  to  provide  a  mini-knowledge  base  about  a 
topic  such  as  a  real  searcher  might  possess.  Additionally 
each  topic  can  have  a  definitions  section  and/or  a  factors 
section.  The  definition  section  has  one  or  two  of  the  defi- 
nitions critical  to  a  human  understanding  of  the  topic. 
The  factors  section  is  included  to  allow  easier  automatic 
query  building  by  listing  specific  items  fi-om  the  narrative 
that  constrain  the  documents  that  are  relevant.  Two  par- 
ticular factors  were  used  m  the  TREC-2  topics:  a  time 
factor  (current,  before  a  given  date,  etc.)  and  a  nationality 
factor  (either  involving  only  certain  countries  or  excluding 
certain  countries). 

While  the  TREC  topics  did  not  present  a  problem  in  scal- 
ing, the  challenge  of  either  automatically  constructing  a 
query,  or  manually  constructing  a  query  with  little  fore- 
knowledge of  its  searching  capability,  was  a  major  chal- 
lenge for  TREC  participants.  In  addition  to  filtering  the 
relatively  large  amount  of  information  provided  in  the 
topics  into  queries,  the  sometimes  narrow  definition  of 
relevance  as  stated  in  the  narrative  was  difficult  for  most 
systems  to  handle. 

3.4  The  Relevance  Judgments 

The  relevance  judgments  are  of  critical  importance  to  a 
test  collection.  For  each  topic  it  is  necessary  to  compile  a 
list  of  relevant  documents;  hopefully  as  comprehensive  a 
list  as  possible.   For  the  TREC  task,  three  possible 
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methods  for  finding  the  relevant  documents  could  have 
been  used.  In  the  first  method,  full  relevance  judgments 
could  have  been  made  on  over  one  million  documents,  for 
each  topic,  resulting  in  over  100  million  judgments.  This 
was  clearly  impossible.  As  a  second  approach,  a  random 
sample  of  the  documents  could  have  been  taken,  with  rel- 
evance judgments  done  on  that  sample  only.  The  problem 
with  this  approach  is  that  a  random  sample  tiiat  is  large 
enough  to  find  on  the  order  of  200  relevant  docxmients  per 
topic  is  a  very  large  random  sample,  and  is  likely  to  result 
in  insufficient  relevance  judgments.  The  third  method,  the 
one  used  in  TREC,  was  to  make  relevance  judgments  on 
the  sample  of  documents  selected  by  the  various  partici- 
pating systems.  This  method  is  known  as  die  pooling 
method,  and  has  been  used  successfully  in  creating  odier 
collections  [Sparck  Jones  &  van  Rijsbergen  1975].  The 
sample  was  constructed  by  taking  the  top  100  documents 
retrieved  by  each  system  for  a  given  topic  and  merging 
them  into  a  pool  for  relevance  assessment.  This  is  a  valid 
sampling  method  since  all  the  systems  used  ranked 
retrieval  mediods,  with  those  docmnents  most  likely  to  be 
relevant  returned  first. 

Pooling  proved  to  be  an  effective  method.  There  was  lit- 
tle overlap  among  the  3 1  systems  in  their  retrieved  docu- 
ments, although  considerably  more  overlap  than  in 
TREC-1. 

Table  2.  Overlap  of  Submitted  Restilts 


TREC-2 

TREC-1 

Max 

Actual 

Max 

Actual 

Unique 
Documents 
Per  Topic 
(Adhoc,  40  runs 
23  groups) 

4000 

1106.0 

3300 

1278.86 

Unique 
Documents 
Per  Topic 
(Routing.  40  runs 
24  groups) 

4000 

1465.6 

2200 

1066.86 

Table  2  shows  the  overlap  statistics.  The  first  overlap 
statistics  are  for  the  adhoc  topics  (test  topics  against  train- 
ing documents  disks  1  and  2),  and  the  second  statistics  are 
for  the  routing  topics  (training  topics  against  test  docu- 
ments disk  3  only).  For  example,  out  of  a  maximum  of 
4000  possible  unique  documents  (40  runs  times  100  docu- 
ments), over  one-fourth  of  the  documents  were  actually 
imique.  This  means  that  die  different  systems  were  find- 
ing different  documents  as  likely  relevant  documents  for  a 
topic.  Whereas  this  might  be  expected  (and  indeed  has 
been  shown  to  occur,  Katzer  et.  al.  1982)  firom  widely 


differing  systems,  these  overlaps  were  often  between  two 
runs  for  the  same  system.  One  reason  for  the  lack  of 
overlap  is  the  very  large  number  of  documents  that  con- 
tain many  of  the  same  terms  as  the  relevant  documents, 
but  the  major  reason  is  the  very  different  sets  of  terms  in 
the  constructed  queries.  This  lack  of  overlap  should 
improve  the  coverage  of  the  relevance  set.  and  verifies  the 
use  of  die  pooUng  methodology  to  produce  the  sample. 

The  merged  list  of  results  was  then  shown  to  die  human 
assessors.  Each  topic  was  judged  by  a  single  assessor  to 
insure  the  best  consistency  of  judgment.  Varying  numbers 
of  documents  were  judged  relevant  to  the  topics.  For  the 
TREC-2  adhoc  topics  (topics  101-150),  die  median  num- 
ber of  relevant  documents  per  topic  is  201,  down  from 
277  for  topics  51-100  (as  used  for  adhoc  topics  in 
TREC-1).  Only  11  topics  have  more  than  300  relevant 
documents,  with  only  2  topics  having  more  than  500  rele- 
vant documents.  These  topics  were  deliberately  made 
narrower  than  topics  51-100  because  of  a  concern  that 
topics  witii  more  dian  300  relevant  documents  are  likely 
to  have  incomplete  relevance  assessments. 

4.  Evaluation 

An  important  element  of  TEiEC  was  to  provide  a  common 
evaluation  forum.  Standard  recall^Jrecision  and 
recall/fallout  figures  were  calculated  for  each  TREC  sys- 
tem and  diese  are  presented  in  Appendix  A.  A  chart  with 
additional  data  about  each  system  is  shown  in  Appendix 
B.  This  chart  consolidates  information  provided  by  the 
systems  that  describe  features  and  system  timing,  and 
allows  some  primitive  comparison  of  the  amount  of  effort 
needed  to  prodtice  the  results. 

4.1  Definition  of  Recall/Precision  and  Recall/Fallout 
Curves 

Figure  2  shows  typical  recall^recision  curves.  The  x  axis 
plots  the  recall  values  at  fixed  levels  of  recall,  where 


Recall  = 


number  of  relevant  items  retrieved 
total  number  of  relevant  items  in  collection 


The  y  axis  plots  the  average  precision  values  at  those 
given  recall  values,  where  precision  is  calculated  by 


Precision  = 


number  of  relevant  items  retrieved 
total  number  of  items  retrieved 


These  curves  represent  averages  over  die  50  topics.  The 
averaging  mediod  was  developed  many  years  ago  [Salton 
&  McGill  1983]  and  is  well  accepted  by  die  information 
retrieval  community.  The  curves  show  system  perfor- 
mance across  die  full  range  of  retrieval,  i.e..  at  the  early 
stage  of  retrieval  where  the  highly-ranked  documents  give 
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high  accoiracy  or  precisian,  and  at  the  final  stage  of 
retrieval  where  there  is  usually  a  low  accuracy,  but  more 
complete  retrieval.  Note  that  the  use  of  these  curves 
assumes  a  ranked  output  from  a  system.  Systems  that 
provide  an  unranked  set  of  documents  are  known  to  be 
less  effective  and  therefore  were  not  tested  in  the  TREC 
program. 

The  curves  in  figure  2  show  that  system  A  has  a  much 
higher  precision  at  the  low  recall  end  of  the  graph  and 
therefore  is  more  accurate.  System  B  however  has  higher 
precision  at  the  high  recall  end  of  the  curve  and  therefore 
will  give  a  more  complete  set  of  relevant  documents, 
assuming  that  the  user  is  willing  to  look  further  in  the 
ranked  list. 

A  second  set  of  curves  was  calculated  using  the 
recall/fallout  measures,  where  recall  is  defined  as  before 
and  fallout  is  defined  as 

number  of  nonrelevant  items  retrieved 
total  number  of  nonrelevant  items  in  collection 

Note  that  recall  has  the  same  definition  as  the  probability 
of  detection  and  that  faUout  has  the  same  definition  as  the 
probability  of  false  alarm,  so  that  the  recall/fallout  curves 
are  also  the  ROC  (Relative  Operating  Characteristic) 
curves  used  in  signal  processing.  A  sample  set  of  curves 
corresponding  to  the  recall/precision  curves  is  shown  m 
figure  3.  These  curves  show  the  same  order  of  perfor- 
mance as  do  the  recall/precision  curves  and  are  provided 
as  an  alternative  method  of  viewing  the  results.  The  pre- 
sent version  of  the  curves  is  experimental  as  the  curve  cre- 
ation is  particularly  sensitive  to  scaling  (what  range  is 
used  for  calculating  fallout).  The  high  precision  section 
of  the  curves  does  not  show  well  in  figure  3;  the  high 
recall  area  dominates  the  curves. 

Whereas  the  recall/precision  curves  show  the  retrieval 
system  results  as  they  might  be  seen  by  a  user  (since  pre- 
cision measures  the  accuracy  of  each  retrieved  document 
as  it  is  retrieved),  the  recall/fallout  curves  emphasize  the 
abiUty  of  these  systems  to  screen  out  non-relevant  mate- 
rial. In  particular  the  fallout  measure  shows  the  discrima- 
tion  powers  of  these  systems  on  a  large  document  collec- 
tion. For  example,  system  A  has  a  fallout  of  0.02  at  a 
recall  of  about  0.48;  this  means  that  this  system  has 
found  almost  50%  of  the  relevant  documents,  while  only 
retrieving  2%  of  the  non-relevant  docxmients. 

42  Single- Value  Evaluation  Measures 

In  addition  to  recall/precision  and  recall/fallout  cmves, 
there  were  2  single-value  measures  used  m  TREC-2. 

The  first  measure,  the  non-interpolated  average  precision, 
corresponds  to  the  area  under  an  ideal  (non-interpolated) 
recall/precision  curve.    To  compute  this  average,  a 


precision  average  for  each  topic  is  first  calculated.  This  is 
done  by  computing  the  precision  after  every  retrieved  rel- 
evant document  and  then  averaging  these  precisions  over 
the  total  number  of  retrieved  relevant  documents  for  that 
topic.  These  topic  averages  are  then  combined  (averaged) 
across  all  topics  in  the  appropriate  set  to  create  the  non- 
interpolated  average  precision  for  that  set.  . 

The  second  measiire  used  is  an  average  of  the  precision 
for  each  topic  after  100  documents  have  been  retrieved  for 
that  topic.  This  measure  is  useful  because  it  reflects  a 
clearly  comprehended  retrieval  point.  It  took  on  added 
importance  in  the  TREC  environment  because  only  the 
top  100  documents  retrieved  for  each  topic  were  actually 
assessed.  For  this  reason  it  produces  a  guaranteed  evalua- 
tion point  for  each  system. 

4.3  Problems  with  Evaluation 

Since  this  was  the  first  time  that  such  a  large  collection  of 
text  has  been  used  in  open  system  evaluation,  there  were 
some  problems  with  the  existing  methods  of  evaluation. 
The  major  problem  concerned  a  thresholding  effect 
caused  by  the  inability  to  evaluate  ALL  documents 
retrieved  by  a  given  system. 

For  TREC-1  the  groups  were  asked  to  send  in  only  the  top 
200  documents  retrieved  by  their  systems.  This  artificial 
document  cutoff  is  relatively  low  and  systems  did  not 
retrieve  all  the  relevant  documents  for  most  topics  within 
the  cutoff.  All  documents  retrieved  beyond  the  200  mark 
were  considered  nonrelevant  by  default  and  therefore  the 
recall/precision  curves  became  inacciu-ate  after  about  40% 
recall  on  average.  TREC-2  used  the  top  1000  documents 
for  evaluation.  Figure  4  shows  the  difference  in  the 
cxirves  produced  by  various  evaluation  thresholds,  includ- 
ing a  curve  for  no  threshold  (similar  to  the  way  evaluation 
has  been  done  on  the  smaller  collections.).  These  curves 
show  that  the  use  of  a  1000-document  cutoff  has  solved 
most  of  the  thresholding  problem. 

Two  more  issues  in  evaluation  have  become  important. 
The  first  issue  involves  the  need  for  more  statistical  evalu- 
ation. As  will  be  seen  in  the  results,  the  recall/precision 
curves  are  often  close,  and  there  is  a  need  to  check  if  there 
is  truly  any  statistically  significant  differences  between 
two  systems'  results  or  two  sets  of  results  from  the  same 
system.  This  problem  is  currently  under  investigation  in 
collaboration  with  statistical  groups  experienced  in  the 
evaluation  of  information  retrieval  systems. 

Another  issue  mvolves  getting  beyond  the  averages  to  bet-  i 
ter  understand  system  performance.  Because  of  the  huge  I 
number  of  dociunents  and  the  long  topics,  it  is  very 
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Figure  2.  A  Sample  Recall/ftecision  Curve. 
Figure  3.  A  Sample  Recall/Fallout  Curve. 
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Figure  4.  Effect  of  evaluation  cutoffs  on  recall/precision  curves. 


difficult  to  perform  failure  analysis  on  die  results  to  better 
understand  the  retrieval  processes  being  tested.  Without 
better  understanding  of  underlying  system  performance,  it 
will  be  hard  to  consolidate  research  progress.  Some  pre- 
liminary analysis  of  per  topic  performance  is  provided  in 
section  6,  and  and  more  attention  will  be  given  to  this 
problem  in  the  future. 

5.  Results 

5.1  Introduction 

In  general  the  TREC-2  results  showed  significant 
improvements  over  the  TREC-1  results.  Many  of  the 
original  TREC-1  groups  were  able  to  "complete"  their 
system  rebuilding  and  tuning  tasks.  The  results  for 
TREC-2  therefore  can  be  viewed  as  the  "best  first-pass" 
that  most  groups  can  accompUsh  on  this  large  amoimt  of 
data.  The  adhoc  results  in  particular  represent  baseline 
results  from  the  scalmg-up  of  current  algorithms  to  large 
test  collections.  The  better  systems  produced  similar 
resiilts,  results  diat  are  comparable  to  those  seen  using 
these  algorithms  on  smaller  test  collections. 

The  routing  results  showed  even  more  improvement  over 
TREC-1  routmg  results.  Some  of  this  improvement  was 
due  to  the  availabiUty  of  large  numbers  of  accurate 


relevance  judgments  for  training  (unlike  TREC-1).  but 
most  of  the  improvements  came  from  new  research  by 
participating  groups  into  better  ways  of  using  the  training 
data. 

For  full  descriptions  of  each  system  discussed  in  this  sec- 
tion, see  the  individual  papers  in  this  proceedings. 

5.2  Adhoc  Results 

The  adhoc  evaluation  used  new  topics  (101-150)  against 
the  two  disks  of  training  documents  (disks  1  and  2). 
There  were  44  sets  of  results  for  adhoc  evaluation  in 
TREC-2,  with  32  of  them  based  on  runs  for  the  full  data 
set.  Of  these,  23  used  automatic  construction  of  queries, 
9  used  manual  construction,  and  2  used  feedback. 

Figure  5  shows  the  recall/precision  curves  for  the  six 
TREC-2  groups  with  the  highest  non-interpolated  average 
precision  using  automatic  construction  of  queries.  The 
results  marked  "INQOOl"  are  the  INQUERY  system  from 
the  University  of  Massachusetts  (see  Croft,  Callan  & 
Broglio  paper).  This  system  uses  probabiUstic  term 
weighting  and  a  probabilistic  inference  net  to  combine 
various  topic  and  document  features.  The  resxilts  marked 
"dortQ2",  "Brkly3"  and  "cmlL2"  are  all  based  on  the  use 
of  the  Cornell  SMART  system,  but  with  important  varia- 
tions. The  "cmlL2"  run  is  the  basic  SMART  system  from 
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Cornell  University  (see  Buckley,  Allan  &  Salton  paper), 
but  using  less  than  optimal  term  weightings  (by  mistake). 
The  "dortQ2"  results  from  the  University  of  Dortmund 
come  from  using  polynomial  regression  on  the  training 
data  to  find  weights  for  various  pre-set  term  features  (see 
Fuhr,  Pfeifer,  Bremkamp,  Polhnann  &  Buckley  paper). 
The  "Brkly3"  results  from  the  University  of  California  at 
Berkeley  come  from  performing  logistic  regression  analy- 
sis to  learn  optimal  weighting  for  various  term  frequency 
measures  (see  Cooper,  Chen  &  Gey  paper).  The 
"CLARTA"  system  from  the  CLARTT  Corporation 
expands  each  topic  with  noun  phrases  found  in  a  the- 
saiuus  that  is  automatically  generated  for  each  topic  (see 
Evans  &  Lefferts  paper).  The  "Isiasm"  results  are  from 
Bellcore  (see  Dumais  paper).  This  group  uses  latent 
semantic  indexing  to  create  much  larger  vectors  than  the 
more  traditional  vector-space  models  such  as  SMART. 
The  run  marked  "Isiasm"  represents  only  the  base 
SMART  pre-processing  results,  however.  Due  to  process- 
ing errors  the  "improved"  LSI  run  produced  unexpectedly 
poor  results. 

Figure  6  shows  the  recall/precision  curve  for  the  six 
TREC-2  groups  with  the  highest  non-interpolated  average 
precision  using  manual  construction  of  queries.  It  should 
be  noted  that  varying  amounts  of  manual  intervention 
were  used.  The  results  marked  "INQ002",  "siems2",  and 
"CLARTM"  are  automatically  generated  queries  with 
manual  modifications.  The  "INQ(X)2"  results  reflect  vari- 
ous manual  modifications  made  to  the  "INQOOl"  queries, 
with  those  modifications  guided  by  strict  rules.  The 
"siems2"  results  from  Siemens  Corporate  Research,  Inc. 
(see  Voorhees  paper)  are  based  on  the  use  of  the  Comell 
SMART  system,  but  with  the  topics  manually  modified 
(the  "not"  phases  removed).  These  results  were  meant  to 
be  the  base  run  for  improvements  usmg  WordNet,  but  the 
improvements  did  not  materiaUze.  The  "CLARTM" 
results  represent  manual  weighting  of  the  query  terms,  as 
opposed  to  the  automatic  weighting  of  the  terms  that  was 
used  in  "CLARTA."  The  results  marked  "Vtcms2", 
"CnQst2",  and  "T0PIC2"  are  produced  from  queries  con- 
structed completely  manually.  The  "Vtcms2"  results  are 
from  Virginia  Tech  (see  Fox  &  Shaw  paper)  and  show  the 
effects  of  combining  the  results  from  SMART  vector- 
space  queries  with  the  results  from  manually-constructed 
soft  Boolean  P-Norm  type  queries.  The  "CnQst2"  results, 
from  ConQuest  Software  (see  Nelson  paper),  use  a  very 
large  general-purpose  semantic  net  to  aid  in  constructing 
better  queries  from  the  topics,  along  with  sophisticated 
morphological  analysis  of  the  topics.  The  results  marked 
•T0PIC2"  are  fi-om  the  TOPIC  system  by  Verity  Corp. 
(see  Lehman  paper)  and  reflect  the  use  of  an  expert  sys- 
tem working  off  specially-constructed  knowledge  bases  to 
improve  performance. 


Several  comments  can  be  made  with  respect  to  these 
adhoc  results.  First,  the  better  results  (most  of  the  auto- 
matic results  and  the  three  top  manual  results)  are  very 
similar  and  it  is  unlikely  that  there  is  any  statistical  differ- 
ences between  them.  There  is  clearly  no  "best"  method, 
and  the  fact  that  these  systems  have  very  different 
approaches  to  retrieval,  including  different  term  weighting 
schemes,  different  query  construction  methods,  and  differ- 
ent similarity  match  methods  imphes  that  there  is  much 
more  to  be  learned  about  effective  retrieval  techniques. 
As  will  be  seen  in  section  6,  whereas  the  averages  for  the 
systems  may  be  similar,  the  systems  do  better  on  different 
topics  and  retrieve  different  subsets  of  the  relevant  docu- 
ments. 

A  second  point  that  should  be  made  is  that  the  automatic 
query  construction  methods  continue  to  perform  as  weU 
as  the  manual  construction  methods.  Two  groups  (the 
INQUERY  system  and  the  CLARTT  system)  did  explicit 
comparision  of  manually-modified  queries  vs  those  that 
were  not  modified  and  concluded  that  manual  modifica- 
tion provided  no  benefits.  The  three  sets  of  results  based 
on  completely  manuaUy-generated  queries  had  even 
poorer  performance  than  the  manually-modified  queries. 
Note  that  this  result  is  specific  to  the  very  rich  TREC  top- 
ics; it  is  not  clear  that  this  will  hold  for  the  short  topics 
normally  seen  in  other  retrieval  environments. 

As  a  final  point,  it  should  be  noted  that  these  adhoc  results 
represent  significant  improvements  over  the  results  from 
TREC-1.  Figure  7  shows  a  comparison  of  results  for  a 
typical  system  in  TREC-1  and  TREC-2.  Some  of  this 
improvement  is  due  to  unproved  evaluation,  but  the  differ- 
ence between  the  curve  marked  "TREC-l"  and  the  curve 
marked  "TREC-2  looking  at  top  200  only"  shows  signifi- 
cant performance  improvement.  Whereas  this 
improvement  could  represent  a  difference  in  topics  (the 
TElEC-1  curve  is  for  topics  51-100  and  the  TREC-2 
curves  are  for  topics  101-150),  the  TREC-2  topics  are 
generally  felt  to  be  more  difficult  and  therefore  this 
improvement  is  likely  to  be  an  imderstatement  of  the 
actual  improvements. 

Only  two  groups  worked  with  less  than  the  full  document 
collection.  Figure  9  shows  the  results  for  the  one  group 
with  official  TREC-2  category  B  results  (the  results  from 
UCLA  were  received  after  die  deadline).  This  figure 
shows  the  best  results  from  New  York  University  (see 
Strzalkowski  &  Carballo  paper),  compared  with  a  cate- 
gory B  version  of  the  Comell  SMART  results.  The 
"nyuirS"  results  reflect  a  very  mtensive  use  of  natural  lan- 
guage processing  (NLP)  techniques,  including  a  parse  of 
the  documents  to  help  locate  syntactic  phrases,  context- 
sensitive  expansion  of  the  queries,  and  other  NLP 
unprovements  on  statistical  techniques. 
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Figures.  Best  Automatic  Adhoc  Results. 
Figure  6.  Best  Manual  Adhoc  Results. 
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Figure  7.  Typical  Improvements  in  Adhoc  Results. 
Figure  8.  Category  B  Adhoc  Results. 
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53  Routing  Results 

The  routing  evaluation  used  a  subset  of  the  training  topics 
(topics  51-100  were  used)  against  the  new  disk  of  test 
documents  (disk  3).  There  were  40  sets  of  results  for 
routing  evaluation,  with  32  of  them  based  on  runs  for  the 
fuU  data  set.  Of  the  32  systems  using  the  full  data  set,  23 
used  automatic  construction  of  queries,  and  9  used  man- 
ual construction. 

Figure  9  shows  the  recall/precision  curves  for  the  six 
TREC-2  groups  with  the  highest  non-interpolated  average 
precision  using  automatic  construction  of  the  routing 
queries.  Again  three  systems  are  based  on  the  Comell 
SMART  system.  The  plot  marked  "cmlCl"  is  the  actual 
SMART  system,  using  the  basic  Rocchio  relevance  feed- 
back algorithms,  and  adding  many  terms  (up  to  500)  from 
the  relevant  training  documents  to  the  terms  in  the  topic. 
The  "dortPl"  results  come  from  using  a  probabilistically- 
based  relevance  feedback  instead  of  the  vector-space  algo- 
rithm, and  adding  only  20  terms  from  the  relevant  docu- 
ments to  each  query.  These  two  systems  have  the  best 
routing  results.  The  "BrklyS"  system  uses  logistic  regres- 
sion on  both  the  general  frequency  variables  used  in  their 
adhoc  approach  and  on  the  query-specific  relevance  data 
available  for  training  with  the  routing  topics.  The  results 
marked  "cityr2"  are  from  City  University,  London  (see 
Robertson,  Walker,  Jones,  Hancock-Beaulieu  &  Gafford 
paper).  This  group  automatically  selected  variable  ntmi- 
bers  of  terms  (10-25)  from  the  training  documents  for 
each  topic  (the  topics  themselves  were  not  used  as  term 
sources),  and  then  used  traditional  probabilistic  reweight- 
ing  to  weight  these  terms.  The  "INQ003"  results  also  use 
probabilistic  reweighting,  but  use  the  topic  terms, 
expanded  by  30  new  terms  per  topic  from  the  training 
documents.  The  results  marked  "lsir2"  are  more  latent 
semantic  indexing  results  from  Bellcore.  This  run  was 
made  by  creating  a  filter  of  the  singular-value  decomposi- 
tion vector  sum  or  centroid  of  all  relevant  documents  for  a 
topic  (and  ignoring  the  topic  itselO. 

Figure  10  shows  the  recall^recision  curves  for  the  six 
TREC-2  groups  with  the  highest  non-interpolated  average 
precision  using  manual  construction  of  the  routing 
queries.  The  results  marked  "INQ004"  are  from  the 
INQRY  system  using  an  inferential  combination  of  the 
"INQ003"  queries  and  manually  modified  queries  created 
from  the  topic.  The  "trw2"  results  represent  an  adaptation 
of  the  TRW  Fast  Data  Finder  pattern  matching  system  to 
allow  use  of  term  weighting  (see  Metder  paper).  The 
queries  were  manually  constructed  and  the  term  weight- 
ing was  learned  from  the  training  data.  The  "gecrdl" 
results  from  GE  Research  and  Development  Center  (see 
Jacobs  paper)  also  come  from  manually  constructed 
queries,  but  using  a  general-purpose  lexicon  and  the  train- 
ing data  to  suggest  input  to  the  Boolean  pattem  matcher. 


The  results  marked  "CLARTM"  are  similar  to  the 
"CLARTM"  adhoc  results  except  that  the  training  docu- 
ments were  used  as  the  source  for  thesaurus  building,  as 
opposed  to  using  the  top  set  of  retrieved  documents.  The 
"rutcombx"  results  from  Rutgers  University  (see  Belkin, 
Kantor,  Cool  &  Quatrain  paper)  come  from  combining  5 
sets  of  manually  generated  Boolean  queries  to  optimize 
performance  for  each  topic.  The  results  marked 
"T0PIC2"  are  from  the  TOPIC  system  and  reflect  die  use 
of  an  expert  system  working  off  specially-constructed 
knowledge  bases  to  improve  performance. 

As  was  the  case  with  the  adhoc  topics,  the  automatic 
query  construction  methods  continue  to  perform  as  well 
as,  or  in  this  case,  better  than  the  manual  construction 
methods.  A  comparision  of  the  two  INQRY  runs  illus- 
trates this  point  and  shows  that  all  six  results  with  manu- 
ally generated  queries  perform  worse  than  the  six  runs 
with  automatically-generated  queries.  The  availability  of 
the  training  data  allows  an  automatic  tuning  of  the  queries 
that  would  be  difficult  to  duplicate  manually  without 
extensive  analysis. 

Unlike  the  adhoc  results,  there  are  two  runs  ("cmlCl"  and 
"dortPl")  that  are  clearly  better  than  the  others,  with  a  sig- 
nificant difference  between  the  "cmlCl"  results  and  the 
"dortPl"  results  and  also  significant  differences  between 
these  results  and  the  rest  of  the  automatically-generated 
query  results.  In  particular  the  Comell  group's  ability  to 
effectively  use  many  terms  (up  to  500)  for  query  expan- 
sion was  one  of  the  most  interesting  findings  in  TREC-2 
and  represents  a  departure  from  past  results  (see  Buckley, 
Allan,  &  Salton  paper  for  more  on  tiiis). 

As  a  final  point,  it  should  be  noted  that  the  routing  results 
also  represent  significant  improvements  over  die  results 
from  TREC-1.  Figure  11  shows  a  comparison  of  results 
for  a  typical  system  in  TREC-1  and  TREC-2.  Some  of 
this  improvement  is  due  to  the  improved  evaluation  tech- 
niques, but  the  difference  between  the  curve  marked 
"TREC-1"  and  the  curve  marked  "TREC-2  looking  at  top 
200  only"  shows  significant  performance  improvement. 
There  is  even  more  improvement  for  the  routing  results 
than  for  the  adhoc  results,  due  to  better  training  data 
(mostiy  non-existent  for  TREC-1)  and  to  major  efforts  by 
many  groups  in  new  routing  algorithm  experiments. 

Only  four  groups  worked  with  less  than  the  full  document 
collection.  Figure  12  shows  the  results  for  two  of  the 
groups  in  category  B  compared  with  a  category  B  version 
of  the  Cornell  SMART  results.  These  curves  show  the 
results  of  runs  from  New  York  University  (that  were  done 
in  a  similar  metiiod  as  that  used  for  the  adhoc  results)  and 
results  from  DaUiousie  University. 
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Figure  9.  Best  Automatic  Routing  Results. 
Figure  10.  Best  Manual  Routing  Results. 
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Figure  12.  Category  B  Routing  Results. 
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6.  Some  Preliminary  Analysis 
6.1  Introduction 

The  recall/precision  ciirves  shown  in  section  5  represent 
the  average  performance  of  the  various  systems  on  the  full 
sets  of  topics.  It  is  important  to  look  beyond  these  aver- 
ages in  order  to  learn  more  about  how  a  given  system  is 
performing  and  to  discover  some  generalizable  principles 
of  retrieval. 

Individual  systems  are  able  to  do  this  by  performing  fail- 
ure analysis  (see  Dumais  paper  in  this  proceedings  for  a 
good  example)  and  by  running  specific  experiments  to  test 
hypotheses  on  retrieval  behavior  within  a  given  system. 
However,  additional  information  can  be  gained  by  doing 
some  cross-system  comparison:  information  about  spe- 
cific system  behavior  and  information  about  generalized 
information  retrieval  principles.  One  way  to  do  this  is  to 
examine  system  behavior  with  respect  to  test  collection 
characteristics.  A  second  method  is  to  compare  system 
behavior  on  a  topic  by  topic  basis. 

62  The  Effects  of  Test  Collection  Characteristics 

One  particular  test  collection  characteristic  is  the  length  of 
documents,  both  the  average  length  of  documents  m  a  col- 
lection, and  the  variation  in  document  length  across  a  col- 
lection. Document  length  has  significant  effect  on  system 
performance.  A  term  that  appears  10  times  in  a  "short" 
document  is  likely  to  be  more  unportant  to  that  document 
than  if  the  same  term  appeared  10  times  m  a  "long"  docu- 
ment. Table  3  shows  system  performance  across  the  dif- 
ferent document  subcoUections  for  each  of  the  adhoc  top- 
ics, listing  the  total  number  of  documents  that  were 
retrieved  by  the  system  as  well  as  the  nxmiber  of  relevant 
documents  that  were  retrieved. 

Two  particular  points  can  be  seen  from  table  3.  First,  the 
better  systems  retrieve  about  50%  relevant  documents 
from  all  the  subcoUections  except  the  Federal  Register 
(FR).  For  this  subcollection  the  retrieval  rates  are  in  the 
25%  range  because  the  varied  length  of  these  documents 
makes  retrieval  difficult. 

The  second  point  concerning  table  3  is  that  the  retrieval 
rate  across  the  subcoUections  is  highly  varied  among  the 
systems.  For  example  the  "Brkly3"  results  show  that 
many  fewer  Federal  Register  documents  and  more  AP 
were  retrieved  than  for  the  INQUERY  system,  whereas 
the  "CLARTA"  results  show  more  DOE  abstracts  and 
fewer  Wall  Street  Journal  being  retrieved.  These  "biases" 
towards  particular  subcoUections  reflect  the  methods  used 
by  systems  such  as  the  length  normalization  issues, 
domain  concentrations  of  terminology,  and  methods  used 
to  "merge"  results  across  subcoUections  (often  implicit 


merges  during  mdexing). 

A  second  test  collection  characteristic  worth  examining  is 
the  varied  broadness  and  varied  difficulty  of  Uie  topics. 
An  analysis  was  done  [Harman  1994]  to  find  die  topics 
for  which  the  systems  retrieved  the  lowest  percentage  of 
the  relevant  documents  on  average.  These  topics  are  61, 
67, 76, 77, 81, 85, 90, 91, 93,  and  98  for  die  routing  topics 
and  101,  114.  120, 121,  124.  131.  139, 140.  141,  and  149 
for  the  adhoc  topics.  Tables  4  and  5  show  the  top  8  sys- 
tem runs  for  the  individual  topics  based  in  the  average 
precision  (noninterpolated).  These  tables  mix  automatic, 
manual,  and  feedback  results  for  category  A,  and  also  cat- 
egory B  results,  so  they  should  be  interpreted  carefully. 
However  they  do  demonstrate  that  no  consistent  patterns 
appear  for  the  "hard"  topics.  The  two  best  routing  runs 
("cmlCl"  and  "dortPl")  only  do  weU  on  about  half  of 
these  topics,  and  the  adhoc  results  are  even  more  varied. 
Often  systems  that  do  not  perform  well  on  average  are  the 
top  performing  system  for  a  given  topic.  This  verifies 
that,  as  usual,  the  variation  across  the  topics  is  greater 
than  the  variation  across  systems. 

6.3  Cross-System  Analysis 

Tables  4  and  5  not  only  show  the  wide  variation  in  system 
performance,  but  also  raise  several  questions  about  system 
performance  in  general. 

1.  Does  better  average  performance  for  a  system 
result  from  better  performance  on  most  topics  or 
from  comparable  performance  on  most  topics  and 
significantly  better  performance  on  other  topics? 

2.  If  two  systems  perform  simUarly  on  a  given  topic, 
does  that  mean  that  they  have  retrieved  a  large 
proportion  of  the  same  relevant  documents? 

3.  Do  systems  that  use  "similar"  approaches  have  a 
high  overlap  in  the  particular  relevant  documents 
they  rettieve? 

4.  And,  if  number  3  is  not  true,  what  are  the  issues 
that  affect  high  overlap  of  relevant  documents? 

Work  is  ongoing  at  NIST  on  tiiese  questions  and  other 
related  issues. 
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Table  3.  Number  of  Documents  Retrieved/Relevant  by  Document  Subcoilection 


Run  Tag 

AP 

DOE 

FR 

WSJ 

ZIFF 

DiKiyj 

07/79 
y  1  ILL 

1  847 /81 1 

lOHZ/OJi 

140/191 
jty/ 1  z  1 

citril 

9 1  ^9  /Q7n 

474/170 

190/1/^ 
IZy/  lO 

1  86^;  /770 
loOJ/  /  /U 

180/1 77 
joU/i  /  / 

CIUIZ 

"^48/1  '^fi 

9'\0/67 
ZjU/O/ 

1  814/7';6 
10 It/  /  JO 

189/170 
joZ/ 1  /y 

cityau 

1 J  / o/  /yr 

o'*/4^; 

1 1^0/147 

1601/661 
lOUj/OOl 

167/108 

JO//1UO 

ciiymi 

1Q'?Q/199fi 

94/91 

ZH/Zi 

<i84/14/^ 

J  of/ 1  HO 

9096/8 1 9 

ZUZO/OiZ 

497/171 

tZ  If  I  1  D 

n  APT  A 

40"? /1 70 

419/0^ 

17QS/S1Q 

1'n6/1Q0 
J  jO/  lyu 

9 1  1  /I  ns7 

zu  I/XUO  / 

-188/708 

JOO/ZUO 

ii';/87 

1760/890 

1  /D7/OZU 

107/991 

J7  //ZZl 

9979/091 

9^4/1 19 

ZiD^I  1 IZ 

1 1 1  /74 
ji  1/  /t 

1761/680 

400/1 81 

9914/QXn 

101/04 

4S1/01 

1787/718 

1  /  O  //  /  JO 

'155/184 
J  J  J/  io*+ 

1014/Q44 

648/740 

1 J  J/ J  J 

1 774/771 

500/911 

JU7/Z1 J 

rm1V9 

9 1  #^4/1 OS^ 

687/716 

70/79 
/  y/zz 

1 600/689 

lUUU/OOZ 

470/1 04 
H  l\JI  I  yr 

9081  /I  n'>'^ 

30*^/160 

471/78 
*+/  J/  /o 

1 8 1 8/8 1 S 

lO  iO/O IJ 

191/166 

JZJ/  lOO 

357/171 

186/44 

1924/874 

328/171 

J'i^yJl  XIX 

prima  1 

85/51 

1364/110 

2251/752 

221/94 

firitna9 

1267/614 

199/68 

1210/125 

2124/773 

268/131 

^\fOf  XUU 

0'P/*rfl9 

22S0/852 

294/91 

319/68 

1952/743 

185/77 

X  \J*Jf  1  1 

XJJ.  iV^nUi. 

2140/1042 

409/145 

164/53 

1875/839 

412/181 

HNCadS 

2163/1286 

306/159 

171/67 

1974/1005 

386/237 

9031/1071 

206/1 07 

297/1 1 5 

2184/1023 

282/151 

Z<0^/  X  ^  X 

90R7/1 1 1 1 

901  /1 90 

976/1 1 1 

Z  /  O/ 1 1 1 

9141/1010 

Zi*Ti/lUlU 

005/177 
Lyji  L 1 1 

Isial 

9978/771 

^87/1 94 
JO  1  ll^n 

194/0 
iZt/U 

1448/176 

'561/61 
jOj/oi 

isiasm 

91/^8/1  n^9 
zioo/iujz 

711/711 

/  1  i/Zl  1 

70/17 

1 607 /600 

10U//07U 

/I /I /I /I  o-j 
'I'I'I/  lO  J 

nyuiri 

u/u 

0/0 

0/0 

u/u 

^000/1 160 
jUUU/i  jOU 

0/0 
U/U 

nyuirz 

u/u 

0/0 

u/u 

0/0 
u/u 

'^000/1  '547 
jUUU/i  Jt  / 

0/0 

U/U 

u/u 

0/0 
u/u 

0/0 
u/u 

'^OOO/I  ^547 

JUUU/l  J't  / 

0/0 
u/u 

pircs3 

9 1  no/1 09 1 

1S8/1  '?9 

JJOl  ijZ 

946/86 
zto/oo 

1 000/81*; 

lyyyiojj 

988/1 10 

ZOO/  1 JJ' 

9108/1014 

"^49/1 48 

2^4/8*5 

9012/861 

284/137 

^O'T/  Xj/ 

T\rCf*f\  1 

1000/1024 

315/83 

1 178/205 

XX/  U/ 

1377/980 

x^  1  1 1  y\j\j 

1031/277 

X  V/»J  X  /  ^  /  / 

T^rpoi-*  1 

1667/1024 

695/83 

381/205 

1350/980 

907/277 

1  UlL^Uiiiu  1 

1 090/368 

181/79 

iOl/  /z 

112/18 

1 1^/  xo 

961/1 1 2 

215/79 

1  UlilllCC' 

04*;/^  00 

ni  /46 

161/0 

061/789 
yuj/zoz 

C/*ri  on  1 

90'?8/001 

ZUJO/7V1 

574/180 
J  jt/ioy 

171/18 

1 778/706 

477/1 
t  /  /  /  xoo 

CI  Am  oi 
s>lcIIlSZ 

999^/1 147 

ZZZj/i  it  / 

61 1  /7 1 8 

uz/o 

16^^^/770 
lOJJ/  /  /u 

497/909 

2238/1 173 

654/208 

53/7 

1619/764 

436/194 

TMC8 

2054/859 

146/44 

763/59 

1472/526 

565/183 

TMC9 

1923/802 

77/29 

975/63 

1401/507 

624/171 

T0PIC2 

2292/996 

152/98 

344/100 

1762/889 

384/229 

UREKA2 

385/215 

0/0 

4003/87 

354/144 

258/10 

UREKA3 

755/405 

5/2 

2654/67 

1045/348 

441/22 

uicah 

1612/628 

234/104 

797/137 

1846/356 

511/167 

VTcms2 

2110/1130 

232/107 

444/95 

1859/894 

355/169 

totals 

71354/4630 

12073/669 

21407/396 

79396/3929 

15504/1154 
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Table  4.  System  Rankings  (using  Average  Precision)  on  Individual  Topics 


51 

nyuir2 

nyxiirl 

gecrdl 

T0PIC2 

ADS2 

cityr2 

INQ004 

52 

INQ004 

INQ003 

Brkly4 

pircs2 

VTcms2 

gecrdl 

pircsl 

trwl 

53 

gecrdl 

nyuir2 

trw2 

nyuirl 

CLARTM 

CLARTA 

dortPl 

INQ003 

54 

siemsl 

cmlRl 

schaul 

Brkly4 

INQ003 

cmlCl 

Isirl 

CLARTM 

55 

dortPl 

cmlRl 

cmlCl 

lsir2 

dortVl 

CLARTM 

cityrl 

/"TT     A  T^T*  A 

CLARTA 

56 

trw2 

dortPl 

dortVl 

INQ003 

INQ004 

HNCrtl 

cmlRl 

cmlCl 

57 

INQ003 

lsir2 

INQ004 

trw2 

cmlCl 

1MC6 

VTcms2 

,  1T>  1 

58 

nyuir2 

nyuirl 

rutcombx 

INQ003 

lsir2 

TV  T  f~\  f\f  \  A 

INQ004 

gecrdl 

TTJ   1  1  £ 

BrklyS 

59 

trw2 

Brkly5 

gecrdl 

Isirl 

ilNCrtl 

V  Icmsz 

xUNLrtz 

isirz 

60 

dortPl 

dortVl 

rutcombx 

cmlRl 

LNQU04 

cmlCl 

61 

T0PIC2 

rutcombx 

Brkly4 

idsra2 

cityr2 

Isul 

TNT^nri/i 
1NQUU4 

DrklyS 

62 

cmiRl 

cmlCl 

dortPl 

Isu^l 

CLARIA 

Brkly4 

LLAKIM 

DrklyD 

63 

dortVl 

cmlCl 

pircs2 

cmlRl 

pircsl 

siemsl 

HNCrtl 

J  i  Tl  1 

dortPl 

64 

nyuir2 

lsir2 

INQ004 

INQ003 

Brkly5 

cmlCl 

IT^  1 

cmlRl 

cityr2 

65 

cmlCl 

dortVl 

dortPl 

HNCrtl 

cmlRl 

trw2 

HNCrt2 

Isirl 

66 

pircs2 

pircsl 

dortPl 

dortVl 

cmlRl 

cmlCl 

siemsl 

INQ004 

67 

cmlRl 

cmlCl 

INQ004 

nyuir2 

dortPl 

cityr2 

lsir2 

INQ003 

68 

Brkly5 

cmlCl 

cityrl 

cityr2 

trw2 

INQ003 

1  • 

lsir2 

/~rT     A  T4T^  A 

CLARTA 

69 

erimrl 

Brkly5 

dortVl 

cityr2 

cityrl 

enmr2 

1    "  1 

lsu-1 

T*    11  ^ 

Brkly4 

70 

TMC6 

TMC7 

rutcombx 

VTcms2 

HNCrt2 

T»    11  ^ 

Brkly5 

INQ004 

cityr2 

71 

cmlRl 

cmlCl 

HNCrt2 

siemsl 

CLARTM 

HNCrtl 

A  TiT*  A 

CLARTA 

lsir2 

72 

IT*  1 

cmlRl 

cmlCl 

dortPl 

siemsl 

INQ003 

BrklyS 

cityrl 

73 

INQ003 

,  IT*  1 

cmlRl 

cityr2 

INQ004 

cmlLl 

trw2 

dortPl 

aortVl 

74 

cmlRl 

rutcombx 

cmlCl 

CLARTA 

BrklyS 

dortPl 

siemsl 

dortVl 

75 

cmlCl 

ADS2 

r.  ..—in  1 

cmlRl 

1 

trwl 

lsu^2 

dortPl 

cityr2 

nyuir2 

76 

trw2 

cityr2 

T0PIC2 

TMC6 

TMC7 

cmlCl 

ITl  1 

cmlRl 

INQ003 

77 

cmlRl 

cmlCl 

INQ003 

CLARTM 

dortVl 

dortPl 

Tfc  T  /~\  f\f\  A 

INQ004 

CLARTA 

78 

rutcombx 

T0PIC2 

INQ004 

CLARTM 

INQ003 

dortVl 

pu:cs2 

CLARTA 

79 

cityr2 

cmlRl 

cmlCl 

INQ004 

dortPl 

gecrdl 

1  • 

lsir2 

INQ003 

80 

trwl 

cmlCl 

cmlRl 

cityrl 

Brkly5 

INQ003 

INQ004 

cityr2 

81 

gecrdl 

TMC7 

TMC6 

cityr2 

trw2 

VTcms2 

HNCrt2 

cityrl 

82 

CLARTM 

CLARTA 

trw2 

Brkly5 

pircsl 

pircs2 

dortVl 

dortPl 

83 

T0PIC2 

gecrdl 

trwl 

cmlCl 

HNCrtl 

cmlRl 

cityr2 

cityrl 

84 

dortPl 

cmlCl 

lsir2 

gecrdl 

cmlRl 

dortVl 

trwl 

VTcms2 

85 

cmlRl 

cmlCl 

dortPl 

Brkly5 

trw2 

nyuir2 

dortVl 

siemsl 

86 

gecrdl 

VTcms2 

lsir2 

Isirl 

cityrl 

cmlRl 

cityr2 

cmlCl 

87 

1  • 

lsu'2 

gecrdl 

cityrl 

cityr2 

HNCrtl 

BrklyS 

cmlCl 

HNCrt2 

88 

cmlCl 

cityr2 

cmlRl 

T»    11  >1 

Brkly4 

dortPl 

lsir2 

dortVl 

BrklyS 

OA 

89 

tnv2 

nyuirl 

TOPIC2 

TMC6 

HNCrtl 

mcrl 

HNCrt2 

gecrdl 

90 

gecrdl 

trwl 

_  ...-.1/^1 

cmlLl 

cmlRl 

schaul 

VTcms2 

BrklyS 

dortPl 

91 

trwl 

INQ004 

schaul 

Ti   11  C 

Brkly5 

trw2 

T0PIC2 

HNCrt2 

HNCrtl 

92 

gecrdl 

cmlRl 

lsir2 

cmlCl 

CLARTM 

CLARTA 

nyuirl 

INQ003 

0"^ 
yj 

JTUILUIIIDa 

1  iVlV^O 

trwl 

jDriciyj 

gecrui 

94 

lsir2 

cmlCl 

cityr2 

gecrdl 

INQ004 

CLARTM 

trw2 

cityrl 

95 

VTcms2 

gecrdl 

cmlCl 

BrklyS 

cmlRl 

Brkly4 

trwl 

siemsl 

96 

dortPl 

T0PIC2 

cityrl 

dortVl 

cityr2 

lsir2 

cmlCl 

rutcombx 

97 

idsra2 

HNCrtl 

nyuir2 

dortPl 

HNCrt2 

lsir2 

cmlCl 

T0PIC2 

98 

HNCrtl 

HNCrt2 

cmlCl 

trw2 

DalTx2 

INQ004 

cmlRl 

dortPl 

99 

lsir2 

cmlRl 

dortPl 

CLARTA 

cmlCl 

CLARTM 

dortVl 

cityr2 

100 

cmlCl 

cmlRl 

dortPl 

lsir2 

dortVl 

CLARTA 

CLARTM 

Isirl 
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Table  5.  System  Rankings  (using  Average  Precision)  on  Individual  Topics 


101 

rutcombl 

VTcms2 

cmlV2 

INQ002 

dortQ2 

pircs3 

Brkly3 

CLARTM 

102 

cmlL2 

cmlV2 

VTcms2 

siems3 

dortL2 

INQ002 

siems2 

CLARTM 

103 

siems3 

siems2 

schaul 

citril 

cmlV2 

Isiasm 

HNCad2 

HNCadl 

104 

dortQ2 

CLARTM 

CLARTA 

pircs4 

pircs3 

dortL2 

HNCad2 

Isiasm 

105 

citri2 

Isiasm 

citril 

siems2 

siems3 

cmIV2 

schaul 

cniIL2 

106 

VTcms2 

INQ002 

INQOOl 

T0PIC2 

pircs4 

pircsB 

CLARTM 

dortL2 

107 

CnQstl 

CnQst2 

rutcombl 

T0PIC2 

VTcms2 

INQ002 

rutfmed 

CLARTM 

108 

citril 

dortQ2 

siems3 

VTcms2 

siems2 

HNCad2 

schaul 

dortL2 

109 

dortL2 

cmIL2 

dortQ2 

CLARTA 

CLARTM 

pircs3 

cmlV2 

pircs4 

110 

INQ002 

INQOOl 

Brkly3 

dortQ2 

nyuir3 

nyuir2 

cityau 

siems2 

111 

CLARTA 

CLARTM 

INQOOl 

dortQ2 

Brkly3 

siems2 

siems3 

pircs4 

112 

INQ002 

INQOOl 

VTcms2 

nyuir2 

nyuir3 

HNCadl 

HNCad2 

CnQst2 

113 

VTcms2 

cnilL2 

dortL2 

cmlV2 

nyuirl 

siems2 

CLARTM 

INQ002 

114 

INQ002 

cityau 

VTcms2 

INQOOl 

siems3 

siems2 

Isial 

TOPia 

115 

nyuir2 

nyuir3 

nyuirl 

siems2 

dortL2 

cmlV2 

siems3 

cmlL2 

116 

VTcms2 

CLARTA 

HNCad2 

HNCadl 

siems3 

siems2 

CLARTM 

Brkly3 

117 

citri2 

citril 

dortQ2 

INQOOl 

TMC8 

Isiasm 

gecrd2 

schaul 

118 

nyuir2 

nyuir3 

nyuirl 

T0PIC2 

citymf 

dortQ2 

CLARTA 

INQOOl 

119 

nyuirl 

nyuir2 

nyuir3 

INQ002 

INQOOl 

dortQ2 

citymf 

VTcms2 

120 

citymf 

nyuii2 

nyuir3 

nyuirl 

CnQst2 

CnQstl 

VTcms2 

erima2 

121 

T0PIC2 

CLARTM 

VTcms2 

Brkly3 

nyuirl 

prceol 

INQ002 

rutfmed 

122 

siems2 

siems3 

INQ002 

INQOOl 

dortQ2 

Brkly3 

CLARTM 

cnilV2 

123 

nyuirl 

nyuir2 

nyuir3 

CLARTA 

INQOOl 

INQ002 

CLARTM 

pircs4 

124 

nyuir2 

nyuir3 

nyuirl 

dortL2 

dortQ2 

INQOOl 

Brkly3 

TMC9 

125 

cmlV2 

Brkly3 

cmlL2 

CLARTM 

siems3 

CLARTA 

pircs4 

pircs3 

126 

siems3 

cmlL2 

siems2 

Brkly3 

cmlV2 

INQ002 

CLARTM 

INQOOl 

127 

cityau 

Brkly3 

CLARTA 

HNCad2 

INQOOl 

INQ002 

siems2 

siems3 

128 

VTcms2 

CLARTA 

siems3 

siems2 

CLARTM 

T0PIC2 

citril 

Isiasm 

129 

INQOOl 

INQ002 

cityau 

CLARTM 

siems2 

Brkly3 

cmIL2 

CLARTA 

130 

INQ002 

INQOOl 

dortQ2 

cmlL2 

pircs4 

CLARTM 

dortL2 

pircs3 

131 

T0PIC2 

VTcms2 

HNCadl 

HNCad2 

siems3 

Brkly3 

siems2 

INQ002 

132 

dortL2 

INQOOl 

INQ002 

citril 

citri2 

dortQ2 

HNCad2 

cmlL2 

133 

CnQst2 

CnQstl 

rutcombl 

pircs4 

INQ002 

pircs3 

cityau 

INQOOl 

134 

cmlL2 

dortL2 

nyuirl 

nyuir2 

nyuir3 

INQ002 

INQOOl 

dortQ2 

135 

nyuir2 

nyuir3 

nyuirl 

Brkly3 

INQOOl 

INQ002 

siems3 

siems2 

136 

VTcms2 

CnQstl 

CnQst2 

CLARTM 

pircs4 

CLARTA 

dortQ2 

T0PIC2 

137 

CLARTA 

nyuir2 

nyuir3 

Brkly3 

siems2 

siems3 

CLARTM 

nyuirl 

138 

nyuir2 

nyuir3 

rutCmed 

rutcombl 

nyuirl 

schaul 

gecrd2 

citril 

139 

nyuir2 

nyuir3 

njojirl 

VTcms2 

dortL2 

HNCad2 

dortQ2 

HNCadl 

140 

nyuir2 

nyuir3 

nyuirl 

dortQ2 

dortL2 

INQ002 

siems3 

siems2 

141 

VTcms2 

INQ002 

CnQst2 

INQOOl 

Brkly3 

dortL2 

dortQ2 

CnQstl 

142 

dortQ2 

siems2 

cmIL2 

VTcms2 

siems3 

CLARTM 

cmlV2 

Brkly3 

143 

INQ002 

INQOOl 

siems2 

siems3 

cmlL2 

cmlV2 

nyuir2 

nyuir3 

144 

VTcms2 

Brkly3 

citymf 

cmlV2 

siems3 

Isiasm 

siems2 

HNCad2 

145 

cmlL2 

cmlV2 

dortL2 

CLARTM 

nyuirl 

siems3 

siems2 

dortQ2 

146 

Brkly3 

siems3 

siems2 

Isiasm 

cmlV2 

schaul 

CLARTM 

citril 

147 

HNCad2 

HNCadl 

VTcms2 

citril 

INQ002 

INQOOl 

citymf 

CLARTA 

148 

Isiasm 

cmlL2 

cmlV2 

siems2 

siems3 

Brkly3 

dortL2 

dortQ2 

149 

nyuirl 

CnQst2 

T0PIC2 

CnQstl 

CLARTA 

rutfined 

Brkly3 

rutcombl 

150 

cmlL2 

dortQ2 

CLARTM 

siems3 

INQ002 

INQOOl 

cmlV2 

siems2 
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6.4  Summary 

The  TREC-2  conference  demonstrated  a  wide  range  of 
different  approaches  to  the  retrieval  of  text  from  large 
document  collections.  There  was  significant  improvement 
in  retrieval  performance  over  that  seen  in  TREC-1.  espe- 
cially in  the  routing  task.  The  availability  of  large 
amounts  of  training  data  for  routing  allowed  extensive 
experimentation  in  the  best  use  of  that  data,  and  many  dif- 
ferent approaches  were  tried  in  TElEC-2.  The  automatic 
construction  of  queries  from  the  topics  continued  to  do  as 
well  as,  or  better  than,  manual  construction  of  queries, 
and  this  is  encouraging  for  groups  supportiag  the  use  of 
simple  natural  language  mterfaces  for  retrieval  systems. 

How  well  is  the  TREC  initiative  meeting  its  goals?  There 
is  certainly  mcreased  research  usiog  a  much  larger  collec- 
tion than  had  previously  been  tested.  This  leads  not  only 
to  discovering  interesting  research  problems,  but  also  to 
developmg  algorithms  that  are  ripe  for  transfer  into  com- 
mercial systems.  The  conference  itself  provided  the 
opportunity  for  this;  there  was  open  exchange  between 
the  research  groups  in  universities  and  the  research  groups 
in  commercial  organizations  and  this  is  a  very  critical  part 
of  technology  transfer. 

There  will  be  a  third  TREC  conference  in  1994,  and  all 
the  systems  that  participated  in  TREC-2  will  be  back, 
along  with  additional  groups. 
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Okapi  at  TREC-2 


S  E  Robertson*       S  Walker*       S  Jones*       M  M  Hancock-Beaulieu*       M  Gatford* 


Advisers:  E  Michael  Keen  (University  of  Wales,  Aberyst- 
wyth), Karen  Sparck  Jones  (Cambridge  University),  Peter 
Willett  (University  of  Sheffield) 

1  Introduction 

This  paper  reports  on  City  University's  work  on  the 
TREC-2  project  from  its  commencement  up  to  Novem- 
ber 1993.  It  includes  many  results  which  were  obtained 
after  the  August  1993  deadline  for  submission  of  official 
results. 

For  TREC-2,  as  for  TREC-1,  City  University  used 
versions  of  the  Okapi  text  retrieval  system  much  as  de- 
scribed in  [2]  (see  also  [3,  4]).  Okapi  is  a  simple  and 
robust  set-oriented  system  based  on  a  generalised  prob- 
abilistic model  with  facilities  for  relevance  feedback,  but 
also  supporting  a  full  range  of  deterministic  Boolean  and 
quasi-Boolean  operations. 

For  TREC-1  [1]  the  "standard"  Robertson-Sparck 
Jones  weighting  function  was  used  for  all  runs  (equa- 
tion 1,  see  also  [5]).  City's  performance  was  not  out- 
standingly good  among  comparable  systems,  and  the 
intention  for  TREC-2  was  to  develop  and  investigate  a 
number  of  alternative  probabilistic  term- weighting  func- 
tions. Other  possibilities  included  varieties  of  query  ex- 
pansion, database  models  enabling  paragraph  retrieval 
and  the  use  of  phrases  obtained  by  query  parsing. 

Unfortunately,  a  prolonged  disk  failure  prevented  re- 
alistic test  runs  until  almost  the  deadline  for  submission 
of  results.  A  full  inversion  of  the  disks  1  and  2  database 
was  only  achieved  a  few  hours  before  the  final  auto- 
matic runs.  None  of  the  new  weighting  functions  (Sec- 
tion 1.1)  was  properly  evaluated  until  after  the  results 
had  been  submitted  to  NIST;  we  have  since  discovered 
that  several  of  these  models  perform  much  better  than 
the  weighting  functions  used  for  the  official  runs,  and 
most  of  the  results  reported  herein  are  from  these  later 
runs. 

1.1    The  system 

The  Okapi  system  comprises  a  search  engine  or  basic 
search  system  (BSS),  a  low  level  interface  used  mainly 
for  batch  runs  and  a  user  interface  for  the  manual  search 

'Centre  for  Interactive  Systems  Research,  Department  of  In- 
formation Science,  City  University,  Northampton  Square,  London 
EClV  OHB,  UK 


experiments  (Section  5),  together  with  data  conver- 
sion and  inversion  utilities.  The  hardware  consisted  of 
Sun  SPARC  machines  with  up  to  40  MB  of  memory, 
and,  occasionally,  about  8  GB  of  disk  storage.  Several 
databases  were  used  from  time  to  time:  full  disks  1  and 
2,  AP  (disk  1)  and  WSJ  (disk  1),  full  disk  3.  All  in- 
verted indexes  included  complete  within-document  po- 
sitional information,  enabling  term  frequency  and  term 
proximity  to  be  used.  Typical  index  size  overhead  was 
around  80%  of  the  textfile  size.  Elapsed  time  for  in- 
version of  disks  1  and  2  Wcis  about  two  days.  Running 
a  single  topic  with  evaluation  averaged  from  about  one 
minute  to  ten  minutes,  depending  strongly  on  the  num- 
ber of  query  terms.  All  preliminary  evaluation  used  the 
"old"  SMART  evaluation  program.  Runs  tabulated  in 
this  paper  used  an  early  version  of  the  new  evaluation 
program,  for  which  we  are  grateful  to  Chris  Buckley  of 
Cornell  University. 

2    Some  new  probabilistic 
models 

Statistical  approaches  to  information  retrieval  have  tra- 
ditionally (to  over-simplify  grossly)  taken  two  forms: 

(a)  approaches  based  on  formal  models,  where  the 
model  specifies  an  exact  formula; 

(b)  ad-hoc  approaches,  where  formulae  are  tried  be- 
cause they  seem  to  be  plausible. 

Both  categories  have  had  some  notable  successes.  A 
more  recent  variant  is  the  regression  approach  of  Fuhr 
and  Cooper  (see,  for  example,  [6]),  which  incorporates 
ad-hoc  choice  of  independent  variables  and  functions 
of  them  with  a  formal  model  for  assessing  their  value 
in  retrieval,  selecting  from  among  them  and  assigning 
weights  to  them. 

One  problem  with  the  formal  model  approach  is  that 
it  is  often  very  difficult  to  take  into  account  the  wide 
variety  of  variables  that  are  thought  or  known  to  influ- 
ence retrieval.  The  difficulty  arises  either  because  there 
is  no  known  basis  for  a  model  containing  such  variables, 
or  because  any  such  model  may  simply  be  too  complex 
to  give  a  usable  exact  formula. 

One  problem  with  the  ad-hoc  approach  is  that  there  is 
little  guidance  as  to  how  to  deal  with  specific  variables — 
one  has  to  guess  at  a  formula  and  try  it  out.  This 
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problem  is  also  apparent  in  the  regression  approach — 
although  "trying  it  out"  has  a  somewhat  different  sense 
here  (the  formula  is  tried  in  a  regression  model,  rather 
than  in  a  retrieval  test). 

The  discussions  of  Sections  2.1  and  2.3  exemplify  an 
approach  which  may  offer  some  reconciliation  of  these 
ideas.  Essentially  it  is  to  take  a  formal  model  which 
provides  an  exact  but  intractable  formula,  and  use  it  to 
suggest  a  much  simpler  formula.  The  simpler  formula 
can  then  be  tried  in  an  ad-hoc  fashion,  or  used  in  turn  in 
a  regression  approach.  Although  we  have  not  yet  taken 
this  latter  step  of  using  regression,  we  believe  that  the 
present  suggestion  lends  itself  to  such  methods. 

2.1    The  basic  model 

The  basic  probabilistic  model  is  the  traditional  rele- 
vance weight  model  [5] ,  under  which  each  term  is  given  a 
weight  as  defined  below,  and  the  score  (matching  value) 
for  each  document  is  the  sum  of  the  weights  of  the 
matching  terms: 


w  =  log 


{r  +  0.b)/{R-r  +  0.5) 
{n-r  +  0.b)/{N  -n-R+r  +  0.5) 


(1) 


where 


N  is  the  number  of  indexed  documents; 

n  the  number  of  documents  containing  the 

term; 

R  the  number  of  known  relevant  documents; 
r  the  number  of  relevant  documents  containing 
the  term. 

This  approximates  to  inverse  collection  frequency 
(ICF)  when  there  is  no  relevance  information.  It  will 
be  referred  to  below  (with  or  without  relevance  infor- 
mation) as  w^^\ 

2.2    The  2-Poisson  model  and  term 
frequency 

One  example  of  these  problems  concerns  within- 
document  term  frequency  (tf).  This  variable  figures  in 
a  number  of  ad-hoc  formulae,  and  it  seems  clear  that 
it  can  contribute  to  better  retrieval  performance.  How- 
ever, there  is  no  obvious  reason  why  any  particular  func- 
tion of  if  should  be  used  in  retrieval.  There  is  not  much 
in  the  way  of  formal  models  which  include  a  tf  compo- 
nent; one  which  does  is  the  2-Poisson  model  [7,  8]. 

The  2-Poisson  model  postulates  that  the  distribution 
of  within-document  frequencies  of  a  content-bearing 
term  is  a  mixture  of  two  Poisson  distributions:  one  set 
of  documents  (the  "elite"  set  for  the  particular  term, 
which  may  be  interpreted  to  mean  those  documents 
which  can  be  said  to  be  "about"  the  concept  represented 


by  the  term)  will  exhibit  a  Poisson  distribution  of  a  cer- 
tain mean,  while  the  remainder  may  also  contain  the 
term  but  much  less  frequently  (a  smaller  Poisson  mean). 
Some  earlier  work  in  this  area  [8]  attempted  to  use  an 
exact  formula  derived  from  the  model,  but  had  limited 
success,  probably  partly  because  of  the  problem  of  esti- 
mating the  required  quantities.  The  approach  here  is  to 
use  the  behaviour  of  the  exact  formula  to  suggest  a  very 
much  simpler  function  of  if  which  behaves  in  a  similar 
way. 

The  exact  formula,  for  an  additive  weight  in  the  style 
of  w^^\  of  a  term  t  which  occurs  if  times,  is 

,     (p'A*^e--  +  {1-  p')ii'^ e-^){q'e-^  +  {I  -  gQe"^) 

W  =  log  -^^  ;  -,  —  ^  

{q'X^fe-^  +  (1-  q')n^fe-''){p'e->^  +  (1  -  P'^'") 

(2) 

where 

A  is  the  Poisson  mean  for  tf  in  the  elite  set  for 
t;  _ 

fj,  is  the  Poisson  mean  for  tf  in  the  non-elite 
set; 

p'  is  the  probability  of  a  document  being  elite 
for  t  given  that  it  is  relevant; 
q'  is  the  probability  of  a  document  being  elite 
given  that  it  is  non- relevant. 

As  a  function  of  tf,  this  can  be  shown  to  behave  as 
follows:  it  is  zero  for  if  =  0;  it  increases  monotonically 
with  tf,  but  at  an  ever-decreasing  rate;  it  approaches  an 
asymptotic  maximum  as  if  gets  large.  The  maximum 
is  approximately  the  binary  independence  weight  that 
would  be  assigned  to  an  infallible  indicator  of  eliteness. 

A  very  simple  formula  which  exhibits  similar  be- 
haviour is  tf/{if+  constant).  This  has  an  asymptotic 
limit  of  unity,  so  must  be  multiplied  by  an  appropriate 
binary  independence  weight.  The  regular  binary  inde- 
pendence weight  for  the  presence/absence  of  the  term 
may  be  used  for  this  purpose.  Thus  the  weight  becomes 


w  = 


if 


-w 


(1) 


(3) 


{ki  +  if) 

where  ki  is  an  unknown  constant. 

Several  points  may  be  made  concerning  this  argu- 
ment. It  is  not  by  any  stretch  of  the  imagination 
a  strong  quantitative  argument;  one  may  have  many 
reservations  about  the  2-Poisson  model  itself,  and  the 
transformations  sketched  above  are  hardly  justifiable  in 
any  formal  way.  However,  it  results  in  a  modification  of 
the  binary  independence  weight  which  is  at  least  plau- 
sible, and  has  just  slightly  more  justification  than  plau- 
sibility alone. 

The  constant  in  the  formula  is  not  in  any  way 
determined  by  the  argument.  The  effect  of  choice  of 
constant  is  to  determine  the  strength  of  the  relationship 
between  weight  and  tf  :  a  large  constant  will  make  for  a 
relation  close  to  proportionality  (where  tf  is  relatively 
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small);  a  small  will  mean  that  if  has  relatively  little 
effect  on  the  weight  (at  least  when  if  >  0,  i.e.  when  the 
term  is  present). 

Our  approach  has  been  to  try  out  various  values  of 
ki  (around  1  may  be  about  right  for  the  full  disks  1  and 
2  database).  However,  in  the  longer  term  we  hope  to 
use  regression  methods  to  determine  the  constant.  It 
is  not,  unfortunately,  in  a  form  directly  susceptible  to 
the  methods  of  Fuhr  or  Cooper,  but  we  hope  to  develop 
suitable  methods. 

2.3    Document  length 

The  2-Poisson  model  in  effect  assumes  that  documents 
(i.e.  records)  are  all  of  equal  length.  Document  length 
is  a  variable  which  figures  in  a  number  of  weighting  for- 
mulae. 

We  may  postulate  at  least  two  reasons  why  docu- 
ments might  vary  in  length.  Some  documents  may  sim- 
ply cover  more  material  than  others;  an  extreme  version 
of  this  hypothesis  would  have  a  long  document  consist- 
ing of  a  number  of  unrelated  short  documents  concate- 
nated together  (the  "scope  hypothesis").  An  opposite 
view  would  have  long  documents  like  short  documents, 
but  longer:  in  other  words,  a  long  document  covers  a 
similar  scope  to  a  short  document,  but  simply  uses  more 
words  (the  "verbosity  hypothesis"). 

It  seems  likely  that  real  document  collections  contain 
a  mixture  of  these  effects;  individual  long  documents 
may  be  at  either  extreme  or  of  some  hybrid  type.  All 
the  discussion  below  assumes  the  verbosity  hypothesis; 
no  progress  has  yet  been  made  with  models  based  on 
the  scope  hypothesis. 

The  simplest  way  to  deal  with  this  model  is  to  take 
the  formula  above,  but  normalise  if  for  document  length 
(dl).  If  we  assume  that  the  value  of  ki  is  appropriate 
to  documents  of  average  length  (avdl),  then  this  model 
can  be  expressed  as 


w 


tf 


fci  X  dl 
avdl 


+  if) 


-w 


(1) 


(4) 


A  more  detailed  analysis  of  the  effect  on  the  Poisson 
model  of  the  verbosity  hypothesis  is  given  in  Appendix 
7.4.  This  shows  that  the  appropriate  matching  value  for 
a  document  contains  two  components.  The  first  compo- 
nent is  a  conventional  sum  of  term  weights,  each  term 
weight  dependent  on  both  if  and  dl;  the  second  is  a  cor- 
rection factor  dependent  on  the  document  length  and 
the  number  of  terms  in  the  query  (rag),  though  noi  on 
which  terms  match.  A  similar  argument  to  the  above 
for  if  suggests  the  following  simple  formulation: 

(avdl  —  dl) 

correction  factor  =      x  nq-  — -  (5) 

(avdl  +  dl)        ^  ^ 

where  k2  is  another  unknown  constant. 


Again,  k2  is  not  specified  by  the  model,  and  must 
(at  present,  at  least)  be  discovered  by  trial  and  error. 
Values  in  the  range  0.0-0.3  appear  about  right  for  the 
TREC  databases  (if  natural  logarithms  are  used  in  the 
term- weighting  functions^),  with  the  lower  values  being 
better  for  equation  4  termweights  and  the  higher  values 
for  equation  3. 

2.4    Query  term  frequency  and  query 
length 

A  similar  approach  may  be  taken  to  within-query  term 
frequency.  In  this  case  we  postulate  an  "elite"  set  of 
queries  for  a  given  term:  the  occurrence  of  a  term  in  the 
query  is  taken  as  evidence  for  the  eliteness  of  the  query 
for  that  term.  This  would  suggest  a  similar  multiplier 
for  the  weight: 


w  = 


qtf 


(^3  +  qtf) 


w 


(1) 


(6) 


In  this  case,  experiments  suggest  a  large  value  of  ^3 
to  be  effective — indeed  the  limiting  case,  which  is  equiv- 
alent to 

w  =  qtf  X  w^^^  (7) 

appears  to  be  the  most  effective. 

We  may  combine  a  formula  such  as  6  or  7  with  a 
document  term  frequency  formula  such  as  3.  In  practice 
this  seems  to  be  a  useful  device,  although  the  theory 
requires  more  work  to  validate  it. 

2.5  Adjacency 

The  recent  success  of  weighting  schemes  involving  a 
term-proximity  component  [9]  has  prompted  consider- 
ation of  including  some  such  component  in  the  Okapi 
weighting.  Although  this  does  not  yet  extend  to  a  full 
Keen-type  weighting,  a  method  allowing  for  adjacency 
of  some  terms  has  been  developed. 

Weighting  formulae  such  as  w^^^  can  in  principle  be 
applied  to  any  identifiable  and  searchable  entity  (such 
as,  for  example,  a  Boolean  search  expression).  An  ob- 
vious candidate  for  such  a  weight  is  any  identifiable 
phrase.  However,  the  problem  lies  in  identifying  suit- 
able phrases.  Generally  such  schemes  have  been  applied 
only  to  predetermined  phrases  (e.g.  those  given  in  a  dic- 
tionary and  identified  in  the  documents  in  the  course  of 
indexing).  Keen's  methods  would  suggest  constructing 
phrases  from  all  possible  pairs  (or  perhaps  larger  sets) 
of  query  terms  at  search  time;  however,  for  queries  of 
the  sort  of  size  found  in  TREC,  that  would  probably 
generate  far  too  many  phrases. 

The  approach  here  has  been  to  take  pairs  of  terms 
which  are  adjacent  in  the  query  as  candidate  phrases. 


^  To  obtain  weights  within  a  remge  suitable  for  storage  as  16-bit 
integers,  the  Okapi  system  uses  logarithms  to  base  2°'^ 
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The  present  Okapi  allows  adjacency  searches,  so  a 
phrase  that  is  not  specifically  indexed  can  be  searched, 
and  assigned  a  weight  in  the  usual  Okapi  fashion  as  if 
it  had  been  indexed. 

One  problem  with  that  approach  is  that  the  single 
words  that  make  up  the  phrase  will  probably  also  be 
included  in  the  query,  and  that  suggests  that  a  docu- 
ment which  contains  the  phrase  will  be  overweighted, 
as  it  will  be  given  the  weight  assigned  to  the  phrase 
in  addition  to  the  individual  term  weights.  So  in  the 
present  experiments  the  weight  assigned  to  the  phrase 
has  been  adjusted  downwards,  by  deducting  the  weights 
of  the  constituent  terms,  to  allow  for  the  fact  that  the 
individual  term  weights  have  necessarily  been  added. 
Where  this  correction  would  give  a  negative  weight  to 
the  phrase,  it  has  been  adjusted  again  to  an  arbitrary 
small  positive  number. 

2.6    Weighting  functions  used 

More  than  20  combinations  of  the  weighting  functions 
discussed  above  were  implemented  at  one  time  or  an- 
other. Those  mentioned  in  this  paper  are  listed  here. 
For  brevity,  most  of  the  functions  are  referred  to  as 
BMnn  (Best  Match). 

BMO:  Flat,  or  quorum,  weighting.  Each  term  is  given 
the  same  weight. 

BMl:  w^^^  termweights. 

BM15:  2-Poisson  termweights  as  equation  3  with  doc- 
ument length  correction  as  equation  5. 

BMll:  2-Poisson  termweights  with  document  length 
normalisation  as  equation  4^ . 

3    Document  processing 

For  TREC-1  City  used  an  elaborate  25-field  structure 
which  was  intended  to  make  all  the  disparate  datasets 
on  the  CDs  fit  a  unified  model.  It  would,  for  exam- 
ple, have  been  possible  to  restrict  searches  to  "title", 
"headline"  etc.  In  the  event  only  the  TEXT  was  used. 
For  TREC-2,  fields  which  looked  useful  for  searching 
were  simply  concatenated  into  one  long  field.  For  most 
datasets  fields  other  than  DOCNO  and  TEXT  were 
ignored,  but  the  SJM  LEAD  PARAGRAPH,  the  Ziff 
SUMMARY  and  a  few  additional  fields  from  the  Patents 
records  were  included.  This  was  done  using  a  simple  perl 
script  (in  contrast  to  the  TREC-1  conversion  program 
which  used  lex,  yacc  and  C).  Most  of  the  known  data  er- 
rors were  handled  satisfactorily,  although  for  some  rea- 
son there  still  remained  a  few  duplicate  DOCNOs  from 
disk  1  and/or  2. 

^In  theory  there  was  also  an  equation  5  document  length  cor- 
rection, but  the  best  value  of  k2  was  found  to  be  zero. 


4    Automatic  query  processing 

4.1  Ad-hoc 

A  large  number  of  evaluation  runs  have  been  done  to 
investigate 

•  the  effect  of  query  term  source 

•  the  use  of  a  query  term  frequency  (qtf)  component 
in  term  weighting,  and 

•  the  use  of  algorithmically  derived  term  pairs. 

4.1.1    Derivation  of  queries  from  the  topics 

Topic  processing  was  very  simple.  An  program  (writ- 
ten in  awk)  was  used  to  isolate  the  required  topic 
fields,  which  were  then  parsed  and  the  resulting  terms 
stemmed  in  accordance  with  the  indexing  procedures  of 
the  database  to  be  searched.  A  small  additional  stop  list 
was  applied  to  the  NARRATIVE  and  DESCRIPTION 
fields  only.  If  required,  the  procedure  also  output  pairs 
of  adjacent  terms  which  occur  in  the  same  subfield  of 
the  topic  and  with  no  intervening  punctuation.  For  ex- 
ample the  command 

get_qterms  70  trecl2_93  ted  pairs=l 

applied  to 

<title>  Topic:  Surrogate  Motherhood 
<desc>  Description: 

Document  will  report  judicial  proceedings  and 
opinions  on  contracts  for  surrogate  mother- 
hood. 

<con>  Concept(s): 

1.  surrogate,  mothers,  motherhood 

2.  judge,  lawyer,  court,  lawsuit,  custody,  hear- 
ing, opinion,  finding 

(topic  70) 

gave 


70 

19 

desc:l:contract:l 

70 

19 

con:l:court:l 

70 

19 

con:l:custodi:l 

70 

19 

con:l:find:l 

70 

19 

con:l:hear:l 

70 

19 

con:l:judg:l 

70 

19 

desc:l:judici:l 

70 

19 

con:l:lawsuit:l 

70 

19 

con:l:lawyer:l 

70 

19 

con:l:mother:l 

70 

19 

tit:l:motherhood:3 

70 

19 

con:l:opinion:2 

70 

19 

desc:l:proceed:l 

70 

19 

tit:l:surrog:3 

70 

19 

desc :  2  :contr  act  :surrog:  1 
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70:19:desc:2:ju<lici:proceed:l 
70:19:desc:2:opinion:contract:l 
70 : 1 9  :desc:  2  :proceed  :opinion :  1 
70 : 1 9  :tit  :2  :surrog:motherhood  :2 

where  the  fields  are  topic  number^  topic  length  (number 
of  terms  counting  repeats  but  not  pairs),  source  field 
(in  precedence  order  TITLE  >  CONCEPTS  >  NAR- 
RATIVE >  DESCRIPTION  >  DEFINITIONS),  num- 
ber of  terms,  term  . . . ,  frequency  of  this  term  or  pair  in 
the  topic. 

4.1.2    Document  £ind  query  term  weighting 

Table  1  shows  the  effect  of  varying  query  term  source 
fields  when  no  account  is  taken  of  within-query  term 
frequency. 

Some  tentative  conclusions  can  be  drawn:  adding  TI- 
TLE to  CONCEPTS  improves  most  measures  slightly; 
TITLE  alone  works  well  in  a  surprising  proportion  of 
the  topics;  the  DESCRIPTION  field  is  fairly  harmless 
used  in  conjunction  with  CONCEPTS,  but  NARRA- 
TIVE and  DEFINITIONS  are  detrimental.  (TIME  and 
NATIONALITY  fields,  which  are  occasionally  present, 
were  never  used.)  This  really  only  confirms  what  may 
be  evident  to  a  human  searcher:  that  CONCEPTS  con- 
sists of  search  terms,  but  most  of  the  other  fields  apart 
from  TITLE  are  instructions  and  guidance  to  relevance 
assessors.  A  sentence  such  as  "To  be  relevant,  a  docu- 
ment must  identify  the  case,  state  the  issues  which  are 
or  were  being  decided  and  report  at  least'one  ethical 
or  legal  question  which  arises  from  the  case."  (from 
the  NARRATIVE  field  of  topic  70)  can  only  contribute 
noise. 

However,  when  a  within-query  term  frequency  {qtf) 
component  is  used  in  the  term  weighting,  the  infor- 
mation about  the  relative  importance  of  terms  gained 
from  the  use  of  all  or  most  of  the  topic  fields  seems  to 
outweigh  the  detrimental  effect  of  noisy  terms  such  as 
"identify",  "state",  "issues",  "question".  Some  results 
are  summarised  in  Table  2.  A  number  of  values  of 
were  tried  in  equation  6,  and  a  large  value  proved  best 
overall,  giving  the  limiting  case  (equation  7),  in  which 
the  term  weight  is  simply  multiplied  by  qtf. 

Many  combinations  of  the  weighting  functions  dis- 
cussed in  Section  1.1,  as  well  as  others  not  described 
here,  were  first  tested  on  the  AP  and/or  WSJ  databases. 
Some  of  them  were  eliminated  immediately.  The  func- 
tion defined  as  BM15  gave  almost  uniformly  better  re- 
sults than  w'^^\  after  suitable  values  for  the  constants 
had  been  found.  BMll  appeared  slightly  less  good  than 
BM15  on  the  small  databases,  but  later  runs  on  the 
large  databases  showed  that,  with  suitable  choice  of 
constants,  it  was  substantially,  though  not  uniformly, 
better.  This  may  be  a  consequence  of  the  greater  varia- 


tion in  document  lengths  found  in  the  large  databases. 
Table  3  compares  the  more  elaborate  term  weighting 
functions  with  the  standard  w^^^  weighting  and  with  a 
baseline  coordination  level  run. 

Some  work  was  done  on  the  addition  of  adjacent  pairs 
of  topic  terms  to  the  queries  (see  Section  2.5).  A  num- 
ber of  runs  were  done,  using  several  different  ways  of  ad- 
justing the  "natural"  weights  of  adjacent  pairs.  There 
was  little  difference  between  them,  and  the  results  are 
at  best  only  slightly  better  than  those  from  single  terms 
alone  (Table  3).  There  was  also  little  difference  between 
using  all  adjacent  pairs  and  using  only  those  pairs  which 
derive  from  the  same  sentence  of  the  topic,  with  no  in- 
tervening punctuation. 


4.2  Routing 

Potential  query  terms  were  obtained  by  "indexing"  all 
the  known  relevant  documents  from  disks  1  and  2;  the 
topics  themselves  were  not  used  (nor  were  known  non- 
relevant  documents).  These  terms  were  then  given  w^^^ 
weights  and  selection  values  [11]  given  by  x  w^^^  where 
r  and  R  are  as  in  equation  1. 

A  large  number  of  retrospective  test  runs  were  per- 
formed on  the  complete  disks  1  and  2  database,  in  which 
the  number  of  terms  selected  and  the  weighting  function 
were  the  independent  variables.  Overall,  there  was  little 
difference  in  the  average  precision  over  the  range  10-25 
terms.  This  is  consistent  with  the  results  reported  by 
Harman  in  [10].  With  regard  to  weighting  functions, 
BMl  was  slightly  better  than  BM15.  However,  look- 
ing at  individual  queries,  the  optimal  number  of  terms 
varied  between  three  (several  topics)  and  31  (topic  89) 
with  a  median  of  11;  and  BM15  was  better  than  BMl 
for  27  of  the  topics. 

Two  sets  of  official  queries  and  results  were  produced. 
For  the  cityrl  run,  the  top  20  terms  were  selected  for 
each  topic  and  the  weighting  function  was  BMl.  For 
cityr2  the  test  runs  were  sorted  for  each  topic  by  preci- 
sion at  30  documents  within  recall  within  average  pre- 
cision, and  the  "best"  combination  of  number  of  terms 
and  weighting  function  was  chosen.  When  evaluated 
retrospectively  against  the  full  disks  1  and  2  database 
the  cityr2  queries  were  about  17%  better  on  average 
precision  and  10%  better  on  recall  than  the  cityrl.  The 
official  results  (first  and  second  rows  of  Table  4)  show 
a  similar  difference.  Later,  both  sets  of  queries  were 
repeated  using  BMll  instead  of  the  previous  weighting 
functions  (third  and  fourth  rows  of  the  table).  These 
final  runs  both  show  substantially  better  results  than 
either  of  the  official  runs. 
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Table  1:  Effect  of  varying  query  term  sources  (no  query  term  frequency  component) 


Query 

%  of  tops  where 

source          ave  Igth 

Ave  Prec 

Prec  at  5 

Prec  at  30 

Prec  at  100 

R-Prec 

Recall 

AveP  >  median 

TC 

30.3 

0.300 

0.624 

0.536 

0.440 

0.349 

0.683 

66 

C 

26.7 

0.296 

0.636 

0.524 

0.436 

0.346 

0.686 

58 

TCD 

39.7 

0.297 

0.592 

0.519 

0.429 

0.340 

0.667 

62 

TCND 

81.0 

0.263 

0.612 

0.485 

0.394 

0.306 

0.605 

48 

TCN 

71.6 

0.262 

0.624 

0.481 

0.397 

0.309 

0.604 

50 

TCNDDef 

86.3 

0.257 

0.580 

0.468 

0.387 

0.303 

0.604 

46 

TN 

44.9 

0.181 

0.500 

0.418 

0.320 

0.245 

0.491 

26 

TND 

54.4 

0.179 

0.492 

0.403 

0.317 

0.243 

0.491 

24 

TD 

13.1 

0.170 

0.428 

0.381 

0.297 

0.244 

0.492 

28 

T 

3.6 

0.165 

0.380 

0.343 

0.271 

0.233 

0.471 

32 

Terms:  single. 

Document  termweights:  BMll. 

Database: 

disks  1  and  2. 

Topics  101 

-150 

Query  average  length 

is  the  average  number  of  terms  taking  account  of  repeats 

Table  2:  Effect  of  varying  query  term  sources  (with  query  term  frequency  component) 


Query 

Weight 

%  of  tops  where 

source 

function 

AveP 

P5 

P30 

PlOO 

RP 

Rcl 

AveP  >  median 

TCND 

BMll 

0.360 

0.652 

0.569 

0.4T9 

0.401 

0.754 

92 

TCN 

BMll 

0.356 

0.644 

0.565 

0.482 

0.399 

0.749 

92 

TCNDDef 

BMll 

0.354 

0.648 

0.559 

0.474 

0.395 

0.751 

92 

TCD 

BMll 

0.353 

0.644 

0.565 

0.481 

0.394 

0.750 

90 

TC 

BMll 

0.335 

0.636 

0.560 

0.468 

0.375 

0.723 

86 

TC 

BM15 

0.284 

0.560 

0.485 

0.416 

0.336 

0.685 

56 

TND 

BMll 

0.283 

0.556 

0.503 

0.414 

0.338 

0.652 

60 

TN 

BMll 

0.274 

0.556 

0.497 

0.399 

0.331 

0.643 

56 

TC 

BMl 

0.232 

0.504 

0.435 

0.361 

0.289 

0.601 

28 

Document  term  weights  were  multiplied  hy  qif,  equivalent  to  large  in  eqn  6 
Terms:  single.  Database:  disks  1  and  2.  Topics  101-150 


Table  3:  Effect  of  different  document  term  weighting  functions:  single  terms  and  adjacent  pairs 


Weight 

%  of  tops  where 

function 

Terms 

AveP 

P5 

P30 

PlOO 

RP 

Rcl 

AveP  >  median 

BMll 

singles 

+  "natural"  pairs 

0.307 

0.628 

0.541 

0.448 

0.358 

0.696 

62 

BMll 

singles 

+  all  adj  pairs 

0.304 

0.612 

0.544 

0.447 

0.357 

0.694 

62 

BMll 

singles 

0.300 

0.624 

0.536 

0.440 

0.349 

0.683 

66 

BM15 

singles 

0.227 

0.500 

0.434 

0.351 

0.285 

0.595 

38 

BMl 

singles 

0.199 

0.468 

0.416 

0.326 

0.261 

0.542 

22 

BMO 

singles 

0.142 

0.412 

0.336 

0.270 

0.209 

0.411 

12 

"Natural" 

means  adjacent  in  the  same 

sentence  of  the  topic  with  no 

intervening  punctuation 

Query  term  source:  TC.  ^f/ component 

:  none. 

Datab 

ase:  disks  1  and  2.  Topics:  101-150 
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Table  4:  Some  routing  results 


Weight 

Number 

%  of  tops  where 

function 

of  terms 

AveP 

P5       P30  PlOO 

RP 

Rcl 

AveP  >  median 

BM1/BM15 

variable 

0.356 

0.692    0.561  0.449 

0.388 

0.680 

78 

BMl 

top  20 

0.315 

0.628    0.533  0.432 

0.361 

0.648 

70 

BMll 

variable 

0.394 

0.700    0.599  0.481 

0.429 

0.713 

92 

BMll 

top  20 

0.362 

0.684    0.605  0.459 

0.397 

0.707 

80 

Best  predictive  run  for  comparison  (BMll,  qif  wi 

th  large 

A:3,  source  TCD) 

0.300 

0.612    0.524  0.394 

0.345 

0.632 

68 

Database:  disk  3.  Topics:  51-100 

5  Manual  queries  with  feedback 
5.1    The  user  interface 

The  interface  allowed  the  entry  of  any  number  of 
find  commands  operating  on  "natural  language"  search 
terms.  By  default,  the  system  would  combine  the  result- 
ing sets  using  the  BM15  function  described  in  Section 
2.6,  but  any  operation  specified  by  the  searcher  would 
override  this.  All  user-entered  terms  were  added  to  a 
pool  of  terms  for  potential  use  in  query  expansion.  Ev- 
ery set  produced  had  any  documents  previously  seen  by 
the  user  removed  from  it. 

The  show  (document  display)  command  displayed 
the  full  text  of  a  single  document  (or  as  much  as  the 
user  wished  to  see)  with  the  retrieval  terms  highlighted 
(sometimes  inaccurately).  Unless  specified  by  the  user 
this  would  be  the  highest-weighted  remaining  document 
from  the  most  recent  set.  At  the  end  of  a  document  dis- 
play the  relevance  question 

"Is  this  relevant  (y/n/?)" 

appeared;  the  system  counted  documents  eliciting  the 
"?"  response  as  relevant^.  The  DOCNO  was  then  out- 
put to  a  results  file,  together  with  the  iteration  number. 

Once  some  documents  had  been  judged  relevant  the 
extract  command  would  produce  a  list  of  terms  drawn 
from  the  pool  consisting  of  user-entered  terms  and  terms 
extracted  from  all  relevant  documents.  Terms  in  the 
pool  were  given  w^^>  weights.  User-entered  terms  were 
weighted  as  if  they  had  occurred  in  four  out  of  five  fic- 
titious relevant  documents  (in  addition  to  any  real  rele- 
vant documents  they  might  have  been  present  in).  Thus 
for  user-entered  terms  the  numerator  in  equation  1  be- 
comes (r-|-4-|-0.5)/(i2+5  -  r-4-1-0.5)  [2]. 

Query  expansion  terms  were  selected  from  the  term 
pool  in  descending  order  of  the  selection  value  [11] 
termweight  x  {r  +  4)/{R  +  5)  for  user-entered  terms, 

''It  was  possible  for  seaxchers  to  change  their  minds  about  the 
relevajice  of  a  docimient.  Subsequent  feedback  iterations  handled 
this  correctly,  but  the  DOCNO  would  be  duplicated  in  the  search 
output.  This  appears  to  have  led  to  some  minor  errors  in  the 
frozen  ranks  evaluation  in  a  few  topics. 


otherwise  termweight  x  r/R,  subject  to  not  all  docu- 
ments containing  the  term  having  been  displayed,  and 
the  term  not  being  a  semi-stopword^  (unless  it  was  en- 
tered by  the  user).  A  maximum  of  20  terms  was  used. 
These  selected  terms  were  then  used  automatically  in 
an  expansion  search,  again  with  the  BM15  weighting 
function. 

Each  invocation  of  extract  used  all  the  available  rele- 
vance information,  and  there  was  no  "new  search"  com- 
mand. This  was  intended  to  encourage  compliance  with 
the  TREC  guidelines;  it  was  not  possible  for  a  dissat- 
isfied user  to  restart  a  search.  When  the  searcher  de- 
cided to  finish,  after  some  sequence  of  find,  show  and 
extract  commands,  the  results  command  invoked  a  final 
iteration  of  extract  (provided  there  had  been  at  least 
three  positive  relevance  judgments).  Finally,  the  top 
1000  DOCNOs  from  the  current  set  were  output  to  the 
results  file.  Apart  from  the  aforementioned  commands, 
users  could  do  info  sets  and  history. 

5.2    Searchers  and  search  procedure 

The  searches  were  done  by  a  panel  of  five  staff  and  re- 
search students  from  City  University's  Department  of 
Information  Science.  Search  procedure  was  not  rigidly 
prescribed,  although  some  guidelines  were  given.  There 
was  a  short  briefing  session  and  searchers  were  encour- 
aged to  experiment  with  the  system  before  starting. 
Procedures  seemed  to  be  considerably  influenced  by  in- 
dividual preferences  and  styles.  Some  searches  were 
done  collaboratively. 

Searchers  tried  to  find  relevant  documents  by  any 
means  they  liked  within  a  single  session.  The  number  of 
iterations  of  query  expansion  varied  between  zero  and 
four,  with  a  mean  of  two.  The  IDs  of  all  documents 
looked  at  were  output  to  the  results  file,  together  with 
the  iteration  number.  At  the  end  of  the  session,  if  at 
least  three  relevant  documents  had  been  found  the  sys- 
tem did  a  final  iteration  of  query  expansion  and  output 

*Semi-stopwords  are  words  which,  while  they  may  be  useful 
secirch  terms  if  entered  by  a  user,  are  likely  to  be  detrimental  if 
used  in  query  expansion:  nimierals,  month-names,  common  ad- 
verbs etc. 
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the  top  1000  IDs;  if  less  than  three  the  top  1000  from 
the  set  which  was  finally  "current"  were  output. 

There  seemed  to  be  an  impression  that  the  new  top- 
ics (topicsS)  are  more  difficult  than  the  old.  Results 
may  also  have  been  affected  by  the  huge  stoplist  which 
was  being  used  at  that  time  because  of  a  breakdown 
of  the  only  disk  large  enough  to  hold  the  very  large 
scratch  files  generated  during  inversion.  Lack  of  the 
number  "6"  affected  one  topic,  days  of  the  week  an- 
other ("Black  Monday").  The  searcher  was  urged  to 
leave  "Black  Monday"  to  the  end  in  case  we  were  able 
to  reindex  before  the  deadline,  but  she  decided  to  try  it 
and  thought  it  worked  quite  well. 

An  edited  transcript  of  one  searcher's  notes  is  given 
below  as  Appendix  B. 

5.3  Results 

The  official  results  of  the  manual  run  (Table  5)  are  dis- 
appointing, with  average  precision  0.232  (60%  of  topics 
below  median),  precision  at  100  docs  0.4  and  recall  0.59. 
The  final  iteration  was  later  re-run  with  BMll  instead 
of  BM15,  and  the  results  combined  with  the  feedback 
documents  from  the  original  searches  for  a  frozen  ranks 
evaluation^.  This  did  somewhat  better  on  a  majority 
of  the  topics,  but  overall  the  manual  results  were  very 
poor  compared  to  some  of  the  automatic  runs. 

6    Other  experiments 

6.1    Query  modification  without 
relevance  information 

Some  iterative  automatic  ad  hoc  runs  were  done  in 
which  the  top  10-50  documents  obtained  by  the  best 
existing  method  were  used  (a)  as  a  source  of  additional 
terms  and  (b)  as  a  source  of  "relevance"  information  for 
the  w^^^  weight  calculation. 

Expansion  terms  were  selected  as  described  in  Section 
4.2,  in  descending  order  of  x  w^^\  The  maximum 
number  of  additional  terms  was  set  at  half  the  number 
of  query  terms.  For  many  of  the  topics  most  of  the  top 
terms  extracted  from  the  feedback  documents  were  in 
any  case  topic  terms,  so  the  number  of  additional  terms 
was  small. 

Example  (topic  112) 

Topic  112:  Funding  biotechnology 
30  feedback  documents  used 

In  the  table  which  foUows,  term  sources  are  given  either  as 
doc,  in  the  case  of  expansion  terms,  or  as  a  topic  field,  where 
tit  >  con  >  nar  >  desc.  In  this  example,  final  weights  involve 
a  gt/ component,  and  were  obtained  using  equation  6  with 

^  There  were  two  topics  where  the  searcher  found  no  relevant 
documents,  so  for  these  topics  the  original  results  were  inserted. 


A:3  =  8  (the  resulting  weight  was  multipUed  by  ks  to  obtain 
adequate  granularity  in  an  integer  representation).  For  ex- 
pansion terms,  qtf  was  taken  as  1  and  the  same  correction 
apphed. 


Weights 


Term 

Src 

qtf 

#  docs 

Orig 

Final 

biotechnologi 

tit 

9 

30 

765 

145 

614 

invest 

con 

4 

29 

148 

80 

213 

fund 

tit 

2 

23 

78 

-  55 

88 

capit 

nar 

2 

21 

78 

51 

81 

pharmaceut 

doc 

(0) 

15 

73 

64 

ventur 

nar 

1 

21 

55 

67 

59 

financi. . . 

nar 

2 

17 

64 

36 

57 

startup. . . 

nar 

1 

11 

70 

62 

55 

research 

nar 

1 

26 

35 

61 

54 

financ 

doc 

(0) 

15 

_ 

54 

48 

partner 

doc 

(0) 

17 

- 

55 

48 

drug 

doc 

(0) 

18 

- 

53 

47 

investor 

doc 

(0) 

19 

_ 

52 

46 

provid 

nar 

3 

14 

66 

21 

45 

firm 

nar 

1 

22 

36 

50 

44 

technologi 

doc 

(0) 

23 

- 

50 

44 

company.  . . 

doc 

(0) 

28 

- 

48 

42 

academ 

nar 

1 

4 

73 

48 

42 

corpor 

nar 

2 

9 

76 

26 

41 

monei 

desc 

1 

18 

37 

43 

38 

stock 

nar 

1 

20 

33 

43 

38 

industri.  . . 

doc 

(0) 

23 

42 

37 

develop 

doc 

(0) 

25 

42 

37 

laboratori 

nar 

1 

9 

51 

39 

34 

quantifi 

nar 

1 

1 

82 

39 

34 

profit 

nar 

1 

14 

40 

38 

33 

enterpr 

nar 

1 

4 

59 

33 

29 

estabUsh 

nar 

1 

10 

38 

29 

25 

arena* 

nar 

2 

0 

148 

15 

24 

data 

nar 

4 

6 

108 

8 

21 

sale 

nar 

1 

12 

30 

24 

21 

loss 

nar 

1 

7 

39 

22 

19 

government.  . . 

nar 

1 

13 

24 

20 

17 

assist 

nar 

1 

6 

39 

20 

17 

much 

desc 

1 

11 

28 

20 

17 

answer 

desc 

1 

2 

52 

16 

14 

follow 

nar 

1 

7 

26 

9 

8 

rel* 

desc 

1 

1 

52 

9 

8 

eg* 

nar 

1 

0 

67 

8 

7 

question 

desc 

1 

3 

37 

8 

7 

worldwid* 

nar 

2 

0 

126 

4 

6 

division* 

nar 

■1 

2 

41 

6 

5 

figur* 

nar 

1 

2 

41 

5 

4 

Here,  nine  of  the  43  terms^  are  not  from  the  topic.  The 
starred  terms  were  not  used  in  the  final  search  because 
their  selection  value  w^^^  x     is  zero  (to  the  nearest 
integer).  For  this  topic,  the  additional  terms  were 
beneficial  and  reweighting  alone  rather  neutral. 


The  terms  followed  by  ellipses  represent  synonym  classes 
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Terms  Wts  I  AveP        P5  P30  PlOO  RP  Rcl 

All  Final  0.407  1.000  0.867  0.640  0.457  0.739 

Topic  Final  0.362  1.000  0.733  0.620  0.440  0.698 

Topic  Prig  |  0.373  0.600  0.800  0.700  0.433  0.680 


6.2  Stemming 

A  comparison  was  made  on  the  AP  database  between 
the  normal  Okapi  stemming  which  removes  many  suf- 
fixes and  a  "weak"  stemming  procedure  which  only  con- 
flates singular  and  plural  forms  and  removes  "ing"  end- 
ings. For  some  weighting  functions  weak  stemming  in- 
creased precision  by  about  2%  and  decreased  recall  by 
about  1%,  but  the  observed  diff'erence  is  unlikely  to  be 
significant. 

6.3  Stoplists 

Some  runs  were  done  on  the  AP  databeise  to  investigate 
the  effect  of  stoplist  size.  A  small  stoplist  consisted  of 
the  17  words 

a,  the,  an,  at,  by,  into,  on,  for,  from,  to, 
with,  of,  and,  or,  in,  not,  et 

and  a  large  one  contained  209  articles,  conjunctions, 
prepositions,  pronouns  and  verbs. 

There  was  no  significant  difference  in  the  results  of  the 
runs,  but  the  index  size  was  about  25%  greater  with  the 
small  stoplist. 

7    Conclusions  and  prospects 

7.1    The  new  probabilistic  models 

The  most  significant  result  is  perhaps  the  great  improve- 
ment in  the  automatic  results  brought  about  by  the  new 
term  weighting  models.  In  the  ad-hoc  runs,  with  no  qtf 
component,  BM15  is  14%  better  than  BMl  on  average 
precision  and  about  9%  better  on  high  precision  and  re- 
call. The  corresponding  figures  for  BMll  are  51%  and 
34%  (Table  3).  For  the  routing  runs,  where  a  consider- 
able amount  of  relevance  information  had  contributed  to 
the  term  weights,  the  improvement  is  less,  but  still  very 
significant  (Table  4).  For  the  manual  feedback  searches 
(Table  5)  there  was  a  small  improvement  when  they 
were  re-run  with  BMll  replacing  BM15  in  the  final  it- 
eration. 

The  drawback  of  these  two  models  is  that  the  theory 
says  nothing  about  the  estimation  of  the  constants,  or 
rather  parameters,  and  k2.  It  may  be  assumed  that 
these  depend  on  the  database,  and  probably  also  on  the 
nature  of  the  queries  and  on  the  amount  of  relevance 
information  available.  We  do  not  know  how  sensitive 
they  are  to  any  of  these  factors.  Estimation  cannot  be 
done  without  sets  of  queries  and  relevance  judgments, 
and  even  then,  since  the  models  are  not  linear,  they  do 
not  lend  themselves  to  estimation  by  logistic  regression. 
The  values  we  used  were  arrived  at  by  long  sequences 
of  trials  mainly  using  topics  51-100  on  the  disks  1  and 
2  database,  with  the  TREC-1  relevance  sets. 


Discussion 

The  main  motive  for  experimenting  with  this  type  of 
query  expansion  is  that  it  is  one  way  of  finding  terms 
which  are  in  some  sense  closely  associated  with  the 
query  as  a  whole.  It  does  not  fit  particularly  well  with 
the  Robertson/Sparck  Jones  type  of  probabilistic  theory 
[5],  the  validity  of  which  depends  on  pairwise  indepen- 
dence of  terms  in  both  relevant  and  nonrelevant  docu- 
ments. However,  it  is  clear,  if  only  from  the  results  in 
this  paper,  that  mutual  dependence  does  not  necessarily 
lead  to  poor  results. 

There  are  many  variables  involved.  In  our  rather  lim- 
ited experiments  most  of  the  initial  feedback  searches 
were  done  under  the  conditions  of  the  first  row  of  Ta- 
ble 2,  that  is  with  terms  from  title,  concepts,  narrative 
and  description  (there  were  a  few  runs  using  title  and 
concepts  only,  but  the  results  for  most  topics  were  not 
good);  and  weighting  function  BMll  with  termweights 
given  by  equation  6  with  large  ^3  (1000).  This  gave 
nearly  the  best  precision  at  5  and  30  documents  of  any 
of  our  results.  The  number  of  feedback  documents  was 
constant  across  topics  and  was  varied  between  10  and 
50.  For  the  final  search,  terms  were  always  weighted 
with  BMll,  but  several  values  of  ^3  were  tried  (in- 
cluding zero).  Some  runs  used  topic  terms  c«ily  and 
some  used  expansion  terms  as  well.  There  was  one  run 
omitting  narrative  and  description  terms  from  the  final 
search,  but  it  was  not  among  the  very  best  and  is  not 
reported  in  the  table.  The  number  of  terms  in  the  final 
search  was  varied  from  10  upwards,  terms  being  selected 
as  usual  in  descending  order  of  iermweight  x  Some 
evaluations  were  done  using  frozen  ranks,  in  case  the 
initial  searches  tended  to  give  better  low  precision,  but 
this  turned  out  not  to  be  the  case. 

A  few  of  the  results  are  summarised  in  Table  6 .  They 
include  results  which  appear  better  than  the  best  oth- 
erwise obtained,  but  the  difference  is  small,  and  these 
runs  have  not  yet  been  repeated  on  the  other  topic  sets. 
A  qtf  weight  component  is  still  needed  (compare  rows 
2  and  14  of  the  table).  The  number  of  feedback  docu- 
ments is  not  critical.  Speeding  searching  by  using  only 
the  top  10  or  20  terms  is  detrimental. 

It  is  interesting  that  results  do  not  seem  to  be  very 
greatly  affected  by  the  precision  of  the  feedback  set. 
Looking  at  the  individual  topics  in  the  run  represented 
by  the  top  row  of  Table  6,  25  did  better  than  in  the 
feedback  run,  18  did  worse  and  the  remainder  about  the 
same.  Restricting  to  the  20  topics  where  the  precision  at 
30  in  the  feedback  set  was  below  0.5,  the  corresponding 
figures  are  7,  10  and  3. 
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Taking  advantage  of  the  very  full  topic  statements  to 
derive  query  term  frequency  weights  gives  another  sub- 
stantial improvement  in  the  automatic  ad-hoc  results. 
Comparing  the  top  row  of  Table  2  with  the  top  row 
of  Table  1,  there  is  a  20%  increase  in  average  precision. 
The  "noise"  effect  of  the  narrative  and  description  fields 
is  far  more  than  outweighed  by  the  information  they 
give  about  the  relative  importance  of  terms  (compare 
the  "TCND"  row  of  Table  1  with  the  top  row  of  Table 

It  remains  to  be  discovered  how  well  these  new  mod- 
els perform  in  searching  other  types  of  database.  Term 
frequency  and  document  length  components  may  not  be 
very  useful  in  searching  brief  records  with  controlled  in- 
dexing, but  one  would  expect  these  models  to  do  well 
on  abstracts.  It  is  also  rare  to  have  query  statements 
which  are  as  full  as  the  TIPSTER  ones,  so  there  are 
many  situations  in  which  a  qif  component  would  have 
little  or  no  effect. 

7.2  Routing 

Our  results  here  (Table  4)  were  relatively  good,  and  fur- 
ther improved  when  re-run  with  BMll.  However,  the 
TREC  routing  scenario  is  perhaps  not  particularly  re- 
alistic, given  the  large  amount  of  relevance  information, 
which  we  made  full  use  of  as  the  sole  source  of  query 
terms.  In  addition,  the  best  of  our  runs  depended  on 
a  long  series  of  retrospective  trials  in  which  the  num- 
ber of  query  terms  was  varied.  In  a  real-world  situation 
one  would  have  to  cope  with  the  early  stages  when  there 
would  be  few  documents  and  little  relevance  information 
(initially  none  at  all).  It  would  be  necessary  to  develop 
a  term  selection  and  weighting  procedure  which  was  ca- 
pable of  progressing  smoothly  from  a  minimum  of  prior 
information  up  to  a  TREC-type  situation.  It  may  be 
possible  to  come  up  with  a  decision  procedure  for  term 
selection  using  something  similar  to  the  selection  value 
w^^^  X  Perhaps  a  future  TREC  could  include  some 
more  restrictive  routing  emulations. 

7.3  Interactive  ad-hoc  searching 

The  result  of  this  trial  was  disappointing  except  on  pre- 
cision at  100  documents  (Table  5),  scarcely  better  than 
the  official  automatic  ad-hoc  run.  On  three  topics  it 
gave  the  best  result  of  any  of  our  runs,  and  two  more 
were  good,  but  the  remaining  45  ranged  from  poor  to 
abysmal.  Little  analysis  has  yet  been  done.  For  some 
topics  it  is  clear  that  the  search  never  got  off  the  ground 
because  the  searcher  was  unable  to  find  enough  relevant 
documents  to  provide  reliable  feedback  information,  but 
the  mean  number  found  per  topic  was  ten,  which  should 
have  been  enough  to  give  reasonable  results  (cf  Table 
6,  where  ten  feedback  documents  performs  quite  well). 
Currently,  there  are  discussions  towards  a  more  realistic 


set  of  rules  for  interactive  searching  for  TREC-3,  and 
we  hope  to  develop  a  better  procedure  and  interface. 

7.4  Prospects 
Paragraphs 

When  searching  full  text  collections  one  often  does  not 
want  to  search,  or  even  necessarily  to  retrieve,  complete 
documents.  Our  new  probabilistic  models  do  not  apply 
to  documents  where  the  verbosity  hypothesis  does  not 
apply  (Section  2.3).  Some  of  the  TREC-2  participants 
searched  "paragraphs"  rather  than  documents,  and  this 
is  clearly  right,  provided  a  sensible  division  procedure 
can  be  achieved.  We  made  some  progress  towards  de- 
veloping a  "paragraph"  database  model  for  the  Okapi 
system,  but  there  has  not  been  time  to  implement  it. 
Further  work  then  needs  to  be  done  on  methods  of  deriv- 
ing the  retrieval  value  of  a  document  from  the  retrieval 
value  of  its  constituent  paragraphs. 

Parameter  estimation 

Work  is  in  progress  on  methods  of  using  logistic  regres- 
sion or  similar  techniques  to  estimate  the  parameters 
for  the  new  models. 

Derivation  and  use  of  phrases  and  term 
proximity 

A  few  results  are  reported  in  Table  3.  They  are  not 
particularly  encouraging.  There  is  probably  scope  for 
further  experiments  in  this  area,  not  only  on  tuples  of 
adjacent  words  but  also  on  Keen-type  [9]  weighting  of 
query  term  clusters  in  retrieved  documents. 
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A    2-Poisson  model  with 

document  length  component 

Basic  ideas 

The  basic  weighting  function  used  is  that  developed  in  [8], 
and  may  be  expressed  as  follows: 


w{x)  =  log 


P{x\R)P{0\R) 
P{x\R)P{q\R) 


(8) 


where 


X  is  a  vector  of  information  about  the  document; 
0  is  a  reference  vector  representing  a  zero-weighted 
document; 

R  and  R  are  relevance  and  non-relevance  respec- 
tively. 

For  example,  each  component  of  x  may  represent  the  pres- 
ence/absence of  a  query  term  in  the  document  (or,  as  in  the 
case  of  formula  2  in  the  main  text,  its  document  frequency); 
0  would  then  be  the  "natural"  zero  vector  representing  all 
query  terms  absent.  In  this  formulation,  independence  as- 
sumptions lead  to  the  decomposition  of  w  into  additive  com- 
ponents such  as  individual  term  weights. 

A  document  length  may  be  added  as  a  component  of  x; 
however,  document  length  does  not  so  obviously  have  a  "nat- 
ural" zero  (an  actual  document  of  zero  length  is  a  patholog- 
ical case).  Instead,  we  may  use  the  average  length  of  a  doc- 
ument for  reference;  thus  we  would  expect  to  get  a  formula 
in  which  the  document  length  component  disappears  for  a 
document  of  average  length,  but  not  for  other  lengths. 


Suppose,  then,  that  the  average  length  of  a  document  is 
A.  The  weighting  formula  becomes: 


w{x,  d)  =  log 


P((x,  d)\R)Pao,A)\R) 
P{ix,d)\R)P(iq,A)\R) 


where  d  is  document  length,  and  x  represents  aU  other  in- 
formation about  the  document.  This  may  be  decomposed  as 
follows: 

w{-x,d)  =  w(x,d)i  +  w(x,d)2  (9) 

where 

P(X\(R.d))PiO\(R,d)) 


w{x,  d)i  =  log 


i'(X|(R,d))P(0|(il,d)) 


and 


w(x  dh  -  loe  P((Q.d)\R)P((Q,A)\R) 

W[X,a)2  -  log  p((0,d)|fi)P((0,A)|R) 

These  two  components  are  discussed  further  below. 
Hypotheses 

As  indicated  in  the  main  text,  one  may  imagine  different 
reasons  why  documents  should  vary  in  length.  The  two  hy- 
potheses given  there  ("scope"  and  "verbosity"  hypotheses) 
may  be  regarded  as  opposite  poles  of  explanation.  The  ar- 
guments below  are  based  on  the  Verbosity  hypothesis  only. 

The  Verbosity  hypothesis  would  imply  that  document 
properties  such  as  relevance  and  eUteness  can  be  regarded  as 
independent  of  document  length;  given  eliteness  for  a  term, 
however,  the  number  of  occurrences  of  that  term  would  de- 
pend on  document  length.  In  particular,  if  we  assume  that 
the  two  Poisson  parameters  for  a  given  term,  A  and  fi,  are 
appropriate  for  documents  of  average  length,  then  the  num- 
ber of  occurrences  of  the  term  in  documents  of  length  d  wOl 
be  2- Poisson  with  means  Xd/A  and  fid/ A. 

Second  component 

The  second  component  of  equation  9  is 

PmR,d))PiO\(R,A))  P{d\R)P{A\R) 
w(x,  d)2  =  log  — —  —  +  log  — 


P{q\(R,d))P(0\(R,A)) 


P{d\R)P{A\R) 


Under  the  Verbosity  hypothesis,  the  second  part  of  this 
formula  is  zero.  Making  the  usual  term-independence  as- 
sumptions, the  first  part  may  be  decomposed  into  a  sum  of 
components  for  each  query  term,  thus: 

(p'e-^^/^  +  (1  -  p')e-^-^/^)(g'e-^  +  (1  -  g')e-M) 

wit.  d)o  —  iog  :  

^      '  ^  (g'e-^'i/^  -f- (1  -  g')e-'^<^/^)(p'e-^ -I- (1 -pOe-*^) 

(10) 

where  <  is  a  query  term  and  p',  q  ,  A  and  fi  are  as  in  formula 
2.  Note  that  there  is  a  component  for  each  query  term, 
whether  or  not  the  term  is  in  the  document. 

For  almost  all  normal  query  terms  (i.e.  for  any  terms  that 
are  not  actually  detrimental  to  the  query),  we  can  assume 
that  p  >  q  and  A  >  /i.  In  this  case,  formula  10  can  be 
shown  to  be  monotonic  decreasing  with  d,  from  a  maximum 
as  d  — *  0,  through  zero  when  d  =  A,  and  to  a  minimum  as 
d  — >•  oo.  As  indicated,  there  is  one  such  factor  for  each  of 
the  nq  query  terms. 

Once  again,  we  can  devise  a  very  much  simpler  function 
which  approximates  to  this  behaviour;  this  is  the  justifica- 
tion for  formula  5  in  the  main  text. 
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First  component  > 

Expanding  the  first  component  of  9  on  the  basis  of  term 
independence  assumptions,  and  also  making  the  assumption 
that  eliteness  is  independent  of  document  length  (on  the 
basis  of  the  Verbosity  hypothesis),  we  can  obtain  a  formula 
for  the  weight  of  a  term  t  which  occurs  tf  times.  This  formula 
is  sinular  to  equation  2  in  the  main  text,  except  that  A  and 
H  are  replaced  by  Ad/A  and  fid/A.  The  factors  d/A  in 
components  such  as  A*-'^  cancel  out,  leaving  only  the  factors 
of  the  form  e-^'^'^. 

Analysis  of  the  behaviour  of  this  function  with  varying  tf 
and  c?  is  a  little  complex.  The  simple  function  used  for  the 
experiments  (formula  4)  exhibits  some  of  the  correct  proper- 
ties, but  not  all.  In  particular,  the  maximum  value  obtained 
as  d  — *■  0  should  be  strongly  dependent  on  tf ;  formula  4  does 
not  have  this  property. 

B    Extracts   from   a  searcher's 
notes 

Choice  of  search  terms 

Suitable  words  and  phrases  occurring  in  title,  description, 
narrative,  concept  and  definition  fields  were  underhned — 
often  this  provided  more  than  enough  material  to  begin 
with.  Sometimes  they  were  supplemented  by  extra  words, 
e.g.  for  a  query  on  international  terrorism  I  added  "nego- 
tiate", "hostage",  "hijack",  "sabotage",  "violence",  "propa- 
ganda", as  well  as  the  names  of  known  terrorist  groups  likely 
to  fit  the  US  bias  of  the  exercise. 

I  did  not  look  at  reference  books  or  other  on-line 
databases,  and  tended  to  avoid  very  specific  terms  like 
proper  names  from  the  query  descriptions,  as  1  found  they 
could  lead  the  search  astray.  For  instance,  the  1986  Immi- 
gration Law  was  also  known  as  the  Simpson-Mazzoh  Act, 
but  the  name  Mazzoli  also  turned  up  in  accounts  of  other 
pieces  of  legislation,  so  it  was  better  to  use  a  combina.tion  of 
"real"  words  about  this  topic. 

In  some  queries,  it  was  necessary  to  translate  an  ab- 
stract concept,  e.g.  "actual  or  alleged  private  sector  eco- 
nomic consequences  of  international  terrorism"  into  words 
which  might  actually  occur  in  documents,  e.g.  "damage", 
"insurance  claims",  "bankruptcy",  etc.  For  this  purpose 
the  use  of  a  genered  (rather  than  domain-specific)  thesaurus 
might  be  a  useful  adjunct  to  the  system. 

Like  the  other  participants  I  was  surprised  at  the  contents 
of  the  stop- word  Hst,  e.g.  "talks",  "recent",  "people",  "new", 
but  not  "these"!  However  it  was  usually  possible  to  find 
synonyms  for  stop-words  and  their  absence  was  not  seriously 
detrimental  to  any  query. 

Grouping  of  terms,  use  of  operators 

Given  the  complexity  of  the  queries,  it  was  obviously  nec- 
essary to  build  them  up  from  smaller  units.  My  original 
intention  was  to  identify  individual  facets  and  create  sets  of 
single  words  representing  each,  then  put  them  together  to 
form  the  whole  query.  [. . .  ]  For  example,  for  a  query  about 


the  prevention  of  nuclear  proliferation  I  had  a  set  of  "nu- 
clear" words  (reprocessing,  piutonium,  etc.),  a  set  of  "con- 
trol" words  (control,  monitor,  safeguards,  etc.)  and  sets  of 
words  for  countries  (argentina,  brazil,  iraq,  etc.)  suspected 
of  violating  international  regulations  on  this  point.  This 
proved  a  bad  strategy — the  large  sets  (whether  ORed  or 
BMed'^  together)  had  low  weightings  because  of  their  collec- 
tively high  frequencies,  and  the  final  query  was  very  diifuse. 

A  more  successful  approach  was  to  buUd  several 
small,  high-weighted  sets  using  phrases  with  OP=ADJ  or 
OP=SAMES[entence]  (e.g.  economic  trends,  gross  national 
product,  standard  of  hving,  growth  rate,  productivity  gains), 
and  then  to  BM  them  together,  perhaps  with  a  few  extra 
singletons  (e.g.  decline,  slump,  recession).  Because  of  the 
TREC  guidelines,  I  didn't  look  at  any  documents  for  the 
small  sets  as  1  went  along,  although  under  normal  circum- 
stances I  would  have  done  so. 

Our  initial  instructions  were  to  use  default  best-matching 
if  at  all  possible,  rather  than  explicit  operators.  As  al- 
ready suggested,  ADJ  and  SAMES  were  an  absolute  neces- 
sity given  the  length  of  documents  to  be  searched,  but  AND 
and  OR  were  generally  avoided — on  the  occasions  when  I 
tried  AND  (out  of  desperation)  it  was  not  particularly  use- 
ful. For  one  query  where  1  thought  it  might  be  necessary 
(to  restrict  a  search  to  documents  about  the  US  economy) 
it  luckily  proved  superfluous  because  of  the  biased  nature  of 
the  database,  indeed  it  would  have  made  the  results  worse  as 
the  US  context  of  these  documents  was  implied  rather  than 
stated. 


Viewing  results,  relevance  feedback 

Normally  I  looked  at  about  the  top  5-10  records  from  the 
first  fuU  query.  If  40%  or  more  seemed  relevant,  the  query 
was  considered  to  be  fairly  satisfactory  and  I  went  on  down 
the  list  trying  to  accumulate  a  dozen  or  so  records  for  the  ex- 
traction phase.  As  . . .  noted  by  other  participants,  there  was 
a  conflict  between  judging  a  record  relevant  because  it  fitted 
the  query,  and  because  it  was  likely  to  yield  useful  new  terms 
for  the  next  phase.  On  the  one  hand  were  the  "newsbyte" 
type  of  documents  containing  one  clearly  relevant  paragraph 
amidst  a  great  deal  of  potential  noise,  and  on  the  other  the 
documents  which  were  in  the  right  area,  contained  all  the 
right  words,  but  failed  the  more  abstract  exclusion  condi- 
tions of  the  query.  1  tried  to  judge  on  query  relevance,  but 
erred  on  the  side  of  permissiveness  for  documents  containing 
the  right  sort  of  terms. 

The  competition  conditions  discouraged  a  really  thorough 
exploration  of  possibilities  when  a  query  was  not  initially 
successful.  In  one  very  bad  case,  having  seen  more  than  20 
irrelevant  records  and  knowing  that  they  would  appear  at 
the  head  of  my  output  Ust,  I  felt  that  the  query  would  show 
up  badly  in  the  [results]  anyway  and  that  it  was  not  worth 
exploring  further,  as  1  might  had  there  been  a  real  question 
to  answer. 


''BM  =  "best  match";  the  default  weighted  set  combination 
operation  was  BM15  (see  Section  2.6) 
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Extracting  new  terms 

I  tried  to  get  at  least  six  relevant  documents  for  the  extrac- 
tion phase,  and  usually  managed  a  few  more.  As  already 
noted,  sets  generated  by  term  extraction  contain  oidy  sin- 
gle words,  so  before  looking  at  the  new  records  I  sometimes 
added  in  a  few  phrases  to  this  set,  either  important  ones  from 
the  original  query  or  others  which  had  occurred  in  relevant 
documents.  The  extracted  sets  of  terms  tended  to  be  larger 
than  the  original  query  and  certainly  included  items  which 
a  human  searcher  (at  least  one  unfamiliar  with  this  genre  of 
literature)  would  not  have  thought  of.  It  was  amusing,  for 
instance,  to  see  "topdrawer"  and  "topnotch"  (epithets  for 
companies)  extracted  from  documents  about  investment  in 
biotechnology,  and  "leftist"  (an  invariable  collocate  for  San- 
danista)  pulled  out  of  documents  about  Nicaraguan  peace 
talks.  Some  material  for  socio-linguistic  analysis  here! 

My  impression  ...  is  that  where  the  original  document  set 
from  which  terms  were  extracted  was  fairly  coherent,  the  de- 
rived set  [from  query  expansion]  also  had  a  high  proportion 
of  relevant  documents.  Not  surprisingly,  where  I  had  scraped 
the  barrel  and  tried  several  different  routes  to  a  few  relevant 
documents,  extraction  produced  equally  miscellaneous  and 
disappointing  results. 

Normally  I  went  through  two  or  three  cycles  of  selec- 
tion/extraction, but  looking  at  fewer  records  each  time.  The 
set  of  extracted  terms  did  not  seem  to  change  materially 
from  one  cycle  to  the  next,  and  I  would  have  expected  the 
final  result  file  reflected  the  query  quite  well  even  though  the 
phrases  had  been  lost. 

Conclusion 

In  spite  of  the  frustrations  of  this  exercise,  I  found  it  a  more 
interesting  retrieval  task  than  normal  bibliographic  search- 
ing, mainly  because  it  was  possible  to  see  the  full  documents 
to  gauge  the  success  of  the  query,  and  use  a  broader  range  of 
natural-language  skills  to  dream  up  potentially  useful  search 
terms. 
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Table  5:  Manual  searches  with  feedback 


Run 

AveP      P5       P30     PlOO      RP  Rcl 

%  of  tops  where 
AveP  >  median 

Official  {BM15) 
Re-run  (BMll) 

0.232  0.492  0.468  0.400  0.297  0.591 
0.247    0.480    0.477    0.411    0.315  0.607 

40 
48 

Database:  disks  1  and  2.  Topics:  101-150 

Table  6:  Some  results  from  query  modification 


#fbk 

docs 

Term 
source 

k3 

Terms 

AveP 

P5 

P30 

PlOO 

RP 

Rcl 

%  of  tops  where 
AveP  >  median 

50 

TCNDdoc 

8 

all 

0.369 

0.660 

0.591 

0.487 

0.408 

0.754 

92 

30 

TCNDdoc 

8 

all 

0.368 

0.668 

0.585 

0.486 

0.407 

0.749 

88 

10 

TCNDdoc 

8 

all 

0.363 

0.668 

0.584 

0.485 

0.400 

0.748 

88 

50 

TCNDdoc 

9 

all 

0.360 

0.624 

0.573 

0.482 

0.399 

0.748 

88 

50 

TCNDdoc 

9 

top  40 

0.354 

0.620 

0.573 

0.478 

0.394 

0.741 

86 

50 

TCNDdoc 

9 

top  30 

0.353 

0.632 

0.567 

0.480 

0.395 

0.742 

88 

30 

TCNDdoc 

8 

top  30 

0.360 

0.676 

0.577 

0.479 

0.402 

0.741 

88 

30 

TCNDdoc 

8 

top  20 

0.348 

0.636 

0.571 

0.474 

0.392 

0.734 

82 

30 

TCNDdoc 

8 

top  10 

0.318 

0.604 

0.537 

0.449 

0.366 

0.702 

78 

50 

TCND 

8 

all 

0.364 

0.636 

0.573 

0.487 

0.406 

0.755 

92 

30 

TCND 

8 

all 

0.362 

0.644 

0.573 

0.484 

0.408 

0.749 

90 

10 

TCND 

8 

all 

0.363 

0.640 

0.574 

0.481 

0.406 

0.754 

88 

30 

TCND 

0 

all 

0.334 

0.652 

0.559 

0.458 

0.374 

0.711 

80 

30 

TCNDdoc 

0 

all 

0.310 

0.645 

0.546 

0.448 

0.359 

0.675 

66 

Initial  feedback  run  for 

comparison  (top 

row  of  Table 

2) 

None 

TCND 

large 

all 

0.360 

0.652 

0.569 

0.479 

0.401 

0.754 

92 

Retrospect 

ive  run 

using  all  known  relevant  documents  to  reweight  the  topic  terms 

Variable 

TCND 

0 

all 

0.371 

0.708 

0.600 

0.497 

0.408 

0.758 

92 

Database: 

disks  1  and  2.  Topics:  101-150 
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Abstract 

This  study  investigated  the  effect  on  retrieval  performance 
of  two  methods  of  combination  of  multiple  representations 
of  TREC  topics.  Five  separate  Boolean  queries  for  each  of 
the  50  TREC  routing  topics  and  25  of  the  TREC  ad  hoc 
topics  were  generated  by  75  experienced  online  searchers. 
Using  the  INQUERY  retrieval  system,  these  queries  were 
both  combined  into  single  queries,  and  used  to  produce  five 
separate  retrieval  results,  for  each  topic.  In  the  former  case, 
results  indicate  that  progressive  combination  of  queries 
leads  to  progressively  improving  retrieval  performance,  sig- 
nificantly better  than  that  of  single  queries,  and  at  least  as 
good  as  the  best  individual  single  query  formulations.  In 
the  latter  case,  data  fusion  of  the  ranked  lists  also  led  to  per- 
formance better  than  that  of  any  single  list. 

1.  Introduction 

The  general  goal  of  our  project  in  the  TREC-2  program 
was  to  investigate  the  effect  of  making  use  of  several  differ- 
ent formulations  of  a  single  information  problem,  on  in- 
formation retrieval  (IR)  system  performance.  The  basis  for 
this  work  lies  in  both  theory  and  empirical  evidence.  From 
the  empirical  point  of  view,  it  has  been  noted  for  some 
time,  that  different  representations  of  the  same  information 
problem  retrieve  sets  (or  ranked  lists)  of  documents  which 
contain  different  relevant,  as  v/ell  as  non-relevant  documents 
(see,  e.g.  McGill,  KoU  &  Norreault,  1979;  Saracevic  &. 
Kantor,  1988).  There  is  some  implication  from  this  evi- 
dence (made  explicit  by  Saracevic  and  Kantor,  1988),  that 
taking  account  of  the  different  results  of  the  different  formu- 
lations, could  lead  to  retrieval  performance  that  is  better 
than  that  of  any  of  the  individual  query  formulations.  From 
the  theoretical  point  of  view,  IR  can  be  considered  as  a 
problem  of  inference  (see,  e.g.  van  Rijsbergen,  1986).  That 
is,  IR  is  concerned  with  estimating,  given  available  evi- 
dence about  such  things  as  information  problems  and  doc- 
uments (or  in  general,  retrievable  information  objects),  the 
likelihood  (or  probability,  or  degree)  of  relevance  of  a  doc- 
ument to  the  information  problem.  From  this  point  of 
view,  different  query  formulations  constitute  different 
sources  of  evidence  which  could  be  used  to  infer  the  proba- 
ble relevance  of  a  document  to  an  information  problem,  and 
it  is  thus  reasonable  to  consider  ways  in  which  lo  use  (i.e. 
combine)  these  sources  of  evidence  in  the  inference  process. 

These  ideas  are  general  to  any  source  of  evidence  which 
might  be  used  for  IR,  such  as  the  evidence  of  different  re- 
trieval techniques,  or  different  document  representation 
techniques,  or,  in  general,  different  IR  systems.  One  aspect 


of  our  project  uses  the  example  of  different  query  formula- 
tions as  a  simulation  of  the  general  problem  of  combina- 
tion of  evidence  from  different  systems. 

An  additional  argument  is  available  for  the  special  case 
of  different  query  representations.  That  is,  if  we  consider  an 
information  problem  to  be  a  complex,  and  in  general  diffi- 
cult-to-specify  entity  (see,  e.g.  Taylor,  1968;  Belkin,  Oddy 
&  Brooks,  1982),  then  we  might  conclude  that  each  differ- 
ent representation,  derived  from  some  statement  by  the  user, 
is  a  different  interpretation  of  the  user's  underlying  informa- 
tion problem,  highly  unlikely  to  be  like  anyone  else's  (or 
any  other  system's)  interpretation.  Given  the  empirical  evi- 
dence, whether  any  one  such  interpretation  is  'better'  than 
another  seems  moot.  However,  we  might  say  that  each  cap- 
tures some  different,  yet  pertinent  aspect  of  the  user's  under- 
lying problem;  or,  that  those  aspects  of  the  different  inter- 
pretations which  are  common  to  them  all  (or  more  than 
one)  reflect  some  'core'  aspect  of  the  problem.  Although 
techniques  for  making  use  of  the  different  interpretations 
might  vary  according  to  which  of  these  two  views  one 
takes,  the  general  position  suggests  that  it  will  always  be  a 
good  idea  to  take  advantage  of  as  many  such  interpretations 
as  possible.  For  this  case,  we  therefore  consider  the  issue 
of  combination  of  different  query  representations  within  the 
'same'  IR  system. 

Our  project,  thus,  considers  the  problem  of  inference  in 
ER  at  two  levels  of  analysis.  The  first  level,  as  introduced 
by  Turtle  &  Croft  (1991),  asks  about  the  effect  of  evidence 
obtained  when  two  or  more  formal  query  statements  are 
produced  for  the  same  information  problem.  The  second 
level,  which  is  simulated  in  this  study,  asks  about  combi- 
nation of  evidence  provided  by  two  or  more  distinct  sys- 
tems, ranking  the  same  set  of  documents  in  response  to  the 
same  problem.  To  distinguish  th^se  two  levels,  and  in 
keeping  with  earlier  discussions  of  the  issues  involved,  we 
henceforth  refer  to  the  combination  of  query  statements  as 
"query  combination",  and  we  refer  to  the  combination  of  ev- 
idence from  differing  systems  as  "data  fiision".  Others  have 
also  addressed  various  aspects  of  this  general  question. 
Apart  from  those  already  cited,  we  mention  in  particular  the 
work  of  Fox  and  his  colleagues  (Fox  et  al.,  1993;  Fox  and 
Shaw,  this  volume),  and  that  of  Belkin,  et  al.  (1993). 
These  studies  in  fact  address  precisely  the  question  of  query 
combination,  the  Belkin  et  al.  work  being  a  direct  precursor 
to  this,  and  the  Fox  et  al.  studies  using  different  query  for- 
mulation, combination  and  retrieval  techniques,  but  with 
very  similar  results. 

Why  ought  either  of  these  two  methods  work  in  the  IR 
situation?  The  central  idea  is  that  either  the  specific  inter- 
nal score,  assigned  to  a  document  for  a  query,  or  the  rank  of 
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a  document  in  the  list  produced  for  a  query,  represents  in- 
formation about  the  relevance  of  the  document  to  the  query. 
For  Boolean  retrieval,  we  may  address  this  question  with 
concepts  of  signal  detection.  In  this  framework,  there  are 
two  conditional  probabilities.  The  probability  that  a  rele- 
vant document  is  retrieved  by  system  S  is  d^.  The  proba- 
bility that  a  not  relevant  document  is  retrieved  is  fg.  If  two 
systems  (or  formulations)  are  independent,  the  posterior  rel- 
evance odds  are  increased  by  the  product  did2/fif2-  In  ac- 
tual application  (Saracevic  and  Kantor,  1988),  improve- 
ments are  not  this  large,  suggesting  either  the  existence  of 
an  effective  base  of  not-relevant  documents,  or  some  effect 
of  interdependence.  It  can  be  shown  that  if  several  query 
formulations  are  drawn  from  a  normal  distribution  centered 
at  the  optimal  query  formulation,  then  some  fraction  of  the 
time,  the  simple  average  of  these  formulations  will  be 
closer  to  the  optimum  than  even  the  best  of  them.  An  even 
larger  fraction  of  the  time,  there  will  be  an  optimum  linear 
combination  which  is  more  nearly  optimal  than  any  of  the 
cases  from  which  it  is  formed  (Kantor,  1993). 

The  existence  of  such  models  explains  why  we  might 
expect  combination  of  evidence,  or  data  fusion,  to  work  for 
the  case  of  several  query  formulations,  as,  for  instance,  in 
the  INQUERY  retrieval  system  (Turtle  &  Croft,  1991). 
But  these  models  do  not  predict  that  these  techniques  must 
work.  The  investigation  of  whether  they  do  work,  is  the 
subject  of  this  paper. 

Specifically,  we  investigate  whether  data  fusion  meth- 
ods will  produce  better  performance  than  any  single  method; 
and,  whether  combination  of  query  formulations  does  better 
than  the  best  individual  query  formulations,  and  whether 
progressive  combination  of  query  formulations  leads  to 
progressively  better  IR  performance.  For  each  of  these 
questions,  we  also  address  the  issue  of  what  methods  to  use 
in  the  combination  of  evidence. 

In  this  paper,  we  do  not  discuss  the  "official"  results 
which  we  submitted  to  TREC-2,  except  in  passing.  The 
reason  for  this  is  that  we  are  not  so  much  interested  in  the 
absolute  performance  of  the  techniques  which  we  use,  as  in 
their  performance  relative  to  one  another.  What  we  are 
most  concerned  with  is  what  happens  to  retrieval  perfor- 
mance as  we  combine  evidence;  if  we  find  that  combining 
evidence  in  specific  ways  leads  to  improvements  over  our 
starting  point  of  non-combination,  then  we  can  begin  to 
investigate  how  to  optimize  starting  points,  as  well  as  rules 
for  combination. 

The  general  plan  of  our  study  was  as  follows.  We  col- 
lected, from  experienced  online  searchers,  five  different 
query  formulations  for  each  of  the  50  routing  topics  and  for 
25  of  the  ad  hoc  topics.  These  query  formulations  were 
then  put  to  the  INQUERY  retrieval  system  (made  available 
to  us  by  the  University  of  Massachusetts),  both  as  single 
queries,  and  as  combinations  of  queries  for  each  topic.  The 
combinations  were  studied  at  various  levels,  with  the  five- 
fold combination  for  each  set  being  reported  as  "official" 
TREC-2  results  for  query  combination.  The  five  retrieved 
lists  for  the  ad  hoc  topics  were  merged,  and  reported  as 
"official"  TREC-2  results  for  data  fusion. 

2.  Methods 


2.1  Query  Formulation  Procedures 

The  query  formulations  used  in  this  study  were  gener- 
ated by  volunteer  online  searchers,  all  of  whom  were  expe- 
rienced users  of  large  bibliographic  retrieval  systems.  In 
order  to  obtain  the  multiple  query  representations,  we  asked 
five  different  searchers  to  generate  Boolean  search  state- 
ments for  each  of  the  TREC  topics  in  our  analysis.  We 
asked  each  of  our  volunteer  searchers  to  generate  a  query 
formulation  for  five  different  topics,  resulting  in  five  inde- 
pendently generated  query  formulations  for  each  topic.  Af- 
ter formulating  each  query,  searchers  were  asked  to  answer 
four  questions  about  the  process:  how  long  it  took  to  for- 
mulate the  query;  how  related  the  topic  was  to  their  normal 
searches;  how  easy  it  was  for  them  to  formulate  the  query; 
and,  the  extent  to  which  they  had  enough  information  to 
construct  the  query.  A  total  of  75  searchers  participated  in 
our  study;  50  for  the  routing  topics,  and  25  for  the  ad  hoc 
topics.  In  addition  to  the  questionnaire  items  mentioned 
above,  the  ad  hoc  searchers  were  also  asked  how  many  years 
of  online  searching  experience  they  had.  Searchers  for  the 
routing  queries  were  not  asked  this  question.  See  the  Ap- 
pendix for  a  sample  response  sheet. 

Our  study  is  based  on  analysis  of  the  entire  set  of  50 
routing  topics,  and  a  selected  sample  of  25  ad  hoc  topics. 
The  sample  was  stratified  according  to  the  domain  of  the 
topic,  in  an  effort  to  represent  the  distribution  of  domains 
in  the  entire  set  of  ad  hoc  topics. 

In  our  experiments,  we  used  the  INQUERY  retrieval  en- 
gine (version  1.5),  developed  at  the  University  of  Mas- 
sachusetts (Turtle  &  Croft,  1991).  INQUERY  is  a  proba- 
bilistic inference  network-based  system,  which  is  based 
upon  the  idea  of  combining  multiple  sources  of  evidence  in 
order  to  plausibly  infer  the  relevance  of  a  document  to  a 
query.  The  underlying  formalism  is  that  of  a  Bayesian 
probabilistic  inference  network  (Pearl,  1988),  which  pro- 
vides strict  rules  for  how  to  combine  sources  of  evidence. 
Turtle  and  Croft  (1991)  give  a  detailed  description  of  the 
model  and  its  implementation;  a  more  general  description 
is  available  in  Belkin  and  Croft  (1992).  Here,  we  note  a 
few  characteristics  of  the  system  which  are  germane  to  the 
project  at  hand. 

First,  INQUERY  provides  a  natural  means  for  combi- 
nation of  multiple  query  formulations,  as  a  function  of  its 
design.  Second,  it  incorporates  a  large  set  of  operators 
which  allow,  in  addition  to  sophisticated  natural  language 
query  formulations,  complex  Boolean  formulations.  The 
Boolean  operators  in  INQUERY  are  not  strict,  however, 
which  allows  ranking  of  output,  and  also  leads  to  signifi- 
cantly better  performance  than  strict  Boolean  retrieval 
(Turtle  and  Croft,  1991).  See  the  paper  by  Croft  in  this 
volume  for  more  detail  on  INQUERY. 

2.2  Query  Combination  Experiments 

Each  of  the  Boolean  query  formulations  produced  by 
our  searchers  was  translated  into  INQUERY  syntax.  Two 
methods  of  query  combination  were  then  used  in  our  study, 
each  specific  to  the  TREC-2  tasks  of  responding  to  ad  hoc 
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and  routing  topics.  The  first,  which  we  label  "combl"  was 
applied  to  the  ad  hoc  topics.  In  this  procedure,  we  simply 
combine  the  five  query  formulations  for  each  topic  directly, 
into  one  query,  using  the  INQUERY  "unweighted  sum"  op- 
erator. This  query  is  then  used  as  the  search  statement  in 
our  experiments.  In  the  ad  hoc  search  environment,  we 
cannot  expect  to  have  relevance  judgments,  and  so  we  can 
do  no  more  than  simple  combination. 

The  second  combinatorial  procedure,  called  "combx", 
was  used  for  the  routing  topics.  Here,  we  did  a  separate 
search  for  each  separate  query  formulation  for  all  50  topics, 
on  the  training  set  supplied  from  the  TREC-1  data.  From 
these  results,  we  used  tfie  average  1 1-point  precision  (in  the 
"official"  results  reported  at  TREC-2;  precision  at  100  doc- 
uments for  the  "unofficial  results"  reported  in  this  paper)  of 
each  query  formulation  as  a  weight  for  that  formulation  in 
the  combination  of  all  five  formulations  for  each  topic. 
For  this,  we  used  INQUERY's  "weighted  sum"  operator. 
This  procedure  corresponds  to  constructing  a  simple  com- 
bined query,  learning  something  about  how  that  query's 
components  perform  on  the  current  database,  and  taking  ac- 
count of  that  evidence  to  modify  the  query  formulation  for 
searching  the  next  database. 

These  methods  of  combining  queries  give  us  a  very 
straightforward  way  to  test  our  hypotheses  about  the  effec- 
tiveness of  multiple  sources  of  evidence.  For  our  experi- 
ments (as  opposed  to  the  results  which  were  submitted  to 
TREC-2,  which  were  just  the  combl  and  combx  results  as 
described  above),  we  divided  the  query  formulations  for  both 
ad  hoc  and  routing  topics,  into  five  different  groups.  In 
each  group,  each  topic  was  represented  by  one  query,  and  no 
searcher  was  represented  more  than  once  in  any  one  group. 
This  distribution  was  meant  to  control  for  possible  searcher 
effects.  We  then  did  runs  for  each  single  group,  and  for 
each  combination  of  groups,  for  both  ad  hoc  and  routing 
topics.  With  these  data,  we  were  able  to  compare  retrieval 
performance  of  different  levels  of  query  combination,  and  to 
compare  retrieval  performance  of  combined  queries  with  un- 
combined. 

2.3   Data  Fusion  Experiments 

Data  fusion  was  accomplished  by  a  list-merging 
method  which  is  the  natural  extension  of  a  3-out-of-5  data 
fusion  logic  in  the  binary  case.  The  basic  data  used  was  the 
five  lists  of  documents  retrieved  by  the  five  different  query 
formulations  for  each  topic.  Every  document  has  some 
rank  in  each  of  the  five  lists  being  joined  together.  An  ef- 
fective rank  is  calculated  by  taking  the  third  highest  of  the 
five  ranks  which  the  document  has.  This  has  the  same  ef- 
fect as  moving  a  threshold  along  the  list  of  effective  ranks, 
and  including  a  document  in  the  output  when  it  has  ap- 
peared on  three  of  the  lists.  Since  there  are  five  scores  all 
together,  this  can  also  be  thought  of  as  a  median  rule. 

In  practice,  to  maintain  consistency  with  other  parts  of 
our  work,  we  did  not  calculate  the  rank  of  every  document, 
but  worked  with  the  lists  of  the  top  1000  documents  pro- 
duced in  response  to  each  query  formulation.  This  meant 
that  some  documents  would  appear  on  all  five  of  the  lists, 
others  on  just  four,  or  three,  or  even  fewer.  Of  course,  the 


whole  logic  of  data  fusion  suggests  that  those  which  appear 
on  more  lists  are  more  likely  to  be  relevant.  We  imple- 
mented this,  in  fact,  by  forming  a  combined  sort  key  con- 
sisting of  (10-degeneracy,  3-rd  rank).  The  degeneracy  is  the 
number  of  lists  on  which  a  specific  document  appears  in  the 
top  1000.  We  used  a  lexicographic  sort,  so  that  all  items 
with  degeneracy  5  appeared  before  any  items  with  degener- 
acy 4,  and  so  on.  Within  a  given  degeneracy,  items  with 
lower  values  for  the  3rd  rank  were  ranked  first. 

3.  Results 

3.1  Caveats 

The  results  presented  in  this  paper  differ  in  several  ways 
from  those  submitted  as  "official"  results  to  TREC-2, 
which  are  published  at  the  end  of  this  volume.  According 
to  our  experimental  design,  there  are  five  independently  pro- 
duced query  formulations  for  each  of  the  TREC  topics. 
However,  due  to  uneven  return  rate  among  our  searchers,  we 
were  missing  one  searcher's  set  of  queries  for  the  ad  hoc 
topics,  and  three  searchers'  sets  of  queries  for  the  routing 
topics,  when  we  did  the  "official"  runs.  Consequently,  in 
the  official  results,  five  ad  hoc  topics  and  fifteen  routing 
topics  are  represented  by  four  searches,  rather  than  five. 
However,  we  were  subsequently  able  to  obtain  substitute 
searchers,  and  so  for  the  "unofficial"  results  presented  and 
discussed  in  this  paper,  we  have  the  full  complement  of  75 
searchers  and  five  query  formulations  per  topic. 

We  were  unable  to  report  the  data  fusion  results  for 
routing  topxcs  for  the  official  results,  because  of  time  con- 
straints. We  have  subsequently  been  able  to  do  those  runs, 
and  report  them  here  as  unofficial  results. 

We  also  caution  that  one  query  for  one  of  our  ad  hoc 
topics  is  known  to  have  a  syntactic  error  which  resulted  in 
very  poor  performance  for  that  single  query,  and  for  all  un- 
weighted combinations  of  queries  in  which  it  was  present. 
Therefore,  some  of  our  comparative  results  in  the  ad  hoc 
case  may  be  slightly  incorrect. 

3.2  General  Results 

Because  our  analyses  of  ad  hoc  topics  are  based  on  a 
subset  of  the  total  sample,  we  here  consider  questions  of  the 
sample  representativeness.  As  explained  above,  the  sample 
was  originally  chosen  to  represent  topic  domains.  To  see  if 
this  had  introduced  some  other  bias,  we  compared  the  distri- 
bution of  our  25  topics  along  the  three  dimensions  of  top- 
ics proposed  by  Harman  (this  volume).  These  are:  broad- 
ness, operationally  defined  as  the  total  number  of  relevant 
documents  found  for  that  topic;  hardness,  operationally  de- 
fined as  inverse  to  the  median  average  precision  for  that 
topic;  and,  restriction,  defined  according  to  linguistic  charac- 
teristics of  the  topic.  The  distribution  of  the  25  topics  in 
our  sample  did  not  differ  significantly  from  the  total  ad  hoc 
topic  distribution  on  any  of  these  dimensions,  so  we  feel 
reasonably  confident  that  we  did  not  select  a  markedly  bi- 
ased subset  of  topics. 

Tables  1  and  2  present  a  descriptive  profile  of  the 
queries  and  topics  in  our  study,  based  upon  the  query  formu- 
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lations,  and  the  searchers'  responses  to  our  questionnaire. 
Table  1  shows  the  distribution  of  numbers  of  words  and  op- 
erators per  query,  and  also  of  time  required  to  construct  a 
query.  Table  2  shows  the  distribution  of  searchers'  attitudes 
to  the  topics,  each  indicated  on  a  scale  of  one  to  five,  from 
least  to  most. 


Mean 

Std  Dev. 

Min. 

Max. 

N 

Operators 

9.94 

5.72 

1.00 

44.00 

375 

Words 

19.40 

14.63 

1.0 

145.00 

375 

Time 
(minutes) 

11.31 

7.48 

1.00 

40.00 

367 

Table  1.  Characteristics  of  queries  for  ad  hoc  and  routing 
topics. 


Mean 

Std  Dev. 

Min. 

Max. 

N 

Familiar- 
ity 

1.81 

1.15 

1.00 

5.00 

388 

Ease  of 
construc- 
tion 

2.82 

1.11 

1.00 

5.00 

372 

Enough 
informa- 
tion 

3.20 

1.11 

1.00 

5.00 

322 

Table  2.  Characterization  of  topics  by  searchers,  for  rout- 
ing and  ad  hoc  topics. 

Our  ad  hoc  questionnaire  also  included  a  question  on 
how  many  years  of  experience  each  searcher  had  in  online 
searching.  The  mean  response  was  6.8  years.  Unfortu- 
nately, we  do  not  have  these  data  for  the  routing  searchers. 

We  wished  to  consider  whether  there  were  any  relation- 
ships between  the  various  characteristics  of  queries  and  top- 
ics and  the  performance  of  the  queries  themselves.  For  this 
purpose,  we  constructed  a  table  in  which  each  separate  query 
formulation  (75x5=375)  is  associated  with  performance 
measures,  the  characteristics  enumerated  in  tables  1  and  2, 
and  the  three  topic  categories  of  broadness,  hardness  and  re- 
striction defined  by  Harman  (this  volume).  For  perfor- 
mance, we  considered  using  one  or  more  of  three  measures: 
average  of  11-point  precision;  precision  at  100  documents; 
and  R-precision  (defined  by  Harman,  this  volume).  Factor 
analysis  of  these  three  measures  showed  that  a  single  factor 
accounts  for  more  than  90%  of  the  variance  among  them, 
so  that  they  represent,  in  effect,  a  single  aspect  or  factor  of 
performance.  The  average  precision  was  chosen  as  represen- 
tative of  this  factor,  and  we  have  used  it  both  in  evaluation 
of  our  retrieval  results,  and  in  attempting  to  determine  the 
effect  of  the  other  variables  we  have  considered,  on  retrieval 
performance.  Since  this  variate  does  not  exhibit  a  normal 
distribution,  logarithmic  and  logistic  transforms  were  ex- 
plored. The  logistic  leads  to  a  most  nearly  normal  distribu- 
tion of  the  transformed  score,  but  we  can  still  not  say  that 
the  transformed  variable  follows  a  normal  distribution. 

The  results  of  applying  ANOVA  to  seek  a  predictor  of 
p  are  shown  in  Table  3.  No  significant  relations  appear. 
Because  of  the  range  of  values  assumed  by  the  variables 
Operators,  Words  and  Time,  the  relation  was  sought  using 


regression  analysis.  Once  again,  no  significant  relations 
were  found,  and  the  scatter  plots  (not  included  here)  make  it 
clear  that  there  is  no  trend  to  be  found..  Both  hardness  and 
broadness  are  significantly  related  to  performance.  The 
former  is  expected,  since  the  hardness  is  determined  by  me- 
dian average  precision;  the  latter  is  less  obvious. 


Analysis  of  variance  for  log(p/(l-p)) 

Independent  variable  Significance 

Familiarity  0. 149   

Easiness  0.169  

Information  0.907 


Table  3.  Significance  levels  of  F-tests  using  ANOVA  to 
seek  dependence  of  the  logistically  transformed  average  pre- 
cision on  the  searcher's  assessments  of  their  query  formula- 
tion. 

The  search  for  relations  between  average  precision  and 
characteristics  of  the  query  formulation,  whether  provided 
by  the  search,  or  determined  from  the  query  text  itself,  was 
motivated  by  the  results,  discussed  below,  which  show  that 
it  is  desirable  to  weight  formulations  in  proportion  to  their 
average  precision.  Thus,  if  we  could  find  a  surrogate  for 
average  precision  which  can  be  known  without  evaluating 
the  retrieved  documents,  it  would  be  possible  to  approxi- 
mate the  effective  combination  on  the  first  pass  of  a  re- 
trieval operation.  This  hope  is  frustrated  at  this  time. 

3.3  Query  Combination  and  Data  Fusion 
Results:    Ad  hoc  Topics 

The  official  results  reported  to  TREC-2  were  for  the 
overall  performance  of  each  of  two  treatments  for  the  ad  hoc 
topics,  and  of  one  treatment  for  the  routing  topics.  For 
those  results,  we  refer  the  reader  to  the  relevant  section  of 
this  volume.  Here  we  report  on  our  further  investigations 
on  the  effect  of  combination  of  queries,  and  of  data  fusion, 
on  performance. 

Our  first  investigation  in  query  combination  was  to  see 
if  combining  query  formulations  has  a  regular,  beneficial  ef- 
fect, as  hypothesized.  To  do  this,  we  generated  the  five  dif- 
ferent search  groups  for  the  ad  hoc  topics,  as  described  in 
section  2.2,  and  did  experimental  runs  on  all  single  query 
groups,  all  2-way  combinations  of  queries,  all  3-way  com- 
binations of  queries,  all  4-way  combinations  of  queries,  and 
the  combination  of  all  5  query  formulations.  •  The  results 
are  presented  in  Table  4,  where  it  is  evident  that  the  average 
performance  increases  monotonically  as  more  evidence  is 
added.  The  increase  is  strict  and  significant,  as  shown  in 
Table  4a,  where  we  display  the  number  of  times  that  each 
combination  level  performed  better  than  each  other  level. 
We  note  that  the  data  fusion  results  are  not  significantly 
better  than  any  but  1-way  combination  (that  is,  average  per- 
formance for  single  queries),  but  also  that  its  performance  is 
not  significantly  different  from  unweighted  5-way  combina- 
tion. 
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1  -way 

2-w3,y 

3  -W3.y 

4-w3,y 

5  -  W3.y 

fiicinn 

0. 1441 

0.2016 

0.2235 

0.2361 

0.2349 

0.2042 

0.1571 

0.1823 

0.2304 

0.2225 

0.1121 

0.2051 

0.2102 

0.2292 

0.1589 

0.1951 

0.2043 

0  2200 

0.1378 

0.1763 

0.2079 

0.2166 

0.2113 

0.2171 

0.1727 

0.2172 

0.1683 

0.1873 

0.1633 

0.2116 

0.1885 

0.1934 

0.1420 

0.1864 

0.2103 

0.2349 

0.2042 

1-way 

2-way 

3-way 

4-way 

5-way 

fusion 

0.2712 

0.3002 

0.2959 

0.2702 

0.2350 

0.2042 

Table  5.  For  ad  hoc  topics,  mean  11-point  precision  for 
best-performing  combination  of  queries  for  each  topic. 


1-way 

2-way 

3-way 

4-way 

5-way 

fusion 

1-way 

6** 

7* 

13 

17 

21** 

2-way 

29** 

16.5 

20** 

22** 

24** 

3-way 

18* 

8.5 

20. 5** 

22** 

22** 

4-way 

12 

5** 

4.5** 

22.5»» 

20** 

5-way 

8 

3** 

3** 

2.5** 

15 

fusion 

4** 

J  ** 

3** 

5** 

10 

For  example,  column  two  represents  the  ten  possible  ways  of 
choosing  two  groups  of  query  formulations  from  the  collection 
of  five  groups.  Each  entry  is  an  average  over  25  topics. 

Table  4.  For  ad  hoc  topics,  average  11-point  precision, 
by  group,  for  each  combination  of  queries,  and  mean  aver- 
age precision  for  all  groups  at  each  level  of  combination. 


**=  significant  difference  at  p  <  .01,  sign  test 

*  =  significant  difference  at  p  <  .05,  sign  test 

Read  row  with  respect  to  column,  e.g.  2-way  performed  better 

than  1-way  19  out  of  25  times,  or  1-way  performed  better  than 

2-way  6  out  of  25  times 

Table  5a.  Number  of  times,  for  performance  of  best  com- 
binations for  ad  hoc  topics,  that  one  treatment  performed 
better  than  another. 


1-way 

2-way 

3-way 

4-way 

5-way 

fusion 

1-way 

J  ** 

3** 

3** 

3** 

5** 

2-way 

24** 

5** 

6** 

5** 

9 

3-way 

22** 

20** 

6** 

3.5** 

11 

4-way 

22** 

19** 

29** 

3** 

12 

5-way 

22** 

20** 

21.5** 

22** 

15 

fusion 

20** 

16 

14 

13 

10 

3.4  Adaptive 
Topics 


Combination:    Ad  hoc 


**=  significant  difference  at  p  <  .01,  sign  test 

*=  significant  difference  at  p  <  .05,  sign  test 

Read  row  with  respect  to  column,  e.g.  2-way  performed  better 

than  1-way  24  out  of  25  times,  or  1-way  performed  better  than 

2-way  1  out  of  25  times 

Table  4a.  Number  of  times,  for  average  performance  of 
combinations  for  ad  hoc  topics,  that  one  treatment  per- 
formed better  than  another. 


The  results  presented  in  Tables  4  and  4a  are  based  on 
the  average  performance  for  the  query  formulations  in  any 
one  set.  In  Tables  5  and  5a,  we  present  data  on  perfor- 
mance, for  ad  hoc  topics,  when  only  the  best  query  formula- 
tion, or  best  combination  of  query  formulations,  for  each 
topic  is  used.  These  results  are  compared  with  the  single  5- 
way  combination  (which  is  the  only  combination  possible 
at  this  level  with  our  data),  and  widi  the  fusion  results.  It 
is  of  some  interest  to  note  that  the  ranking  of  level  of  com- 
bination is  now  very  much  different  than  that  for  average 
performance,  with  2-way  and  3-way  combination  being  sig- 
nificantly better  than  1-way,  4-way,  5-way  and  fusion  (see 
Table  5a). 


Finally,  to  get  an  overall  idea  of  how  query  combina- 
tion in  the  ad  hoc  case  worked,  and  to  estimate  whether  tak- 
ing account  of  the  evidence  of  search  performance  could  im- 
prove subsequent  performance,  we  compared  performance  of 
simple  combination  of  all  five  query  formulations  (combl) 
with  performance  when  only  the  best  single  query  formula- 
tion for  each  topic  was  used  (best),  with  combination  of  all 
five  query  formulations  weighted  according  to  the  precision 
at  100  documents  retrieved,  of  each  formulation  (comby). 
The  results,  reported  in  Tables  6  and  6a,  show  that  there  is 
no  significant  difference  between  combl  and  best,  but  that 
comby  is  significantly  better  than  combl.  While  formation 
of  comby  would  not  be  possible  under  the  conditions  of  the 
ad  hoc  TREC  task,  these  results  are  of  interest  because  they 
simulate  the  kind  of  operations  that  could  be  implemented 
in  a  fully  interactive  interface  to  an  IR  system. 


combl 

best 

comby 

fusion 

0.2350 

0.2712 

0.2819 

0.2042 

combl  =  unweighted  combination  of  all  queries  for  each  topic 
best  =  best  performing  query  for  each  topic 
comby  =  weighted  (by  prec.@100  docs)  combination  of  all 
queries  for  each  topic 

Table  6.  For  ad  hoc  topics,  mean  11-point  precision  for 
four  treatments. 

In  reading  Tables  6  and  6a,  note  that  the  choice  referred 
to  as  "best"  corresponds  exacdy  to  the  choice  called  "1-way" 
in  Table  5.  However,  it  does  not  correspond  to  any  of  the 
entries  in  the  first  column  of  Table  4.  The  entries  in  Table 
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4  refer  to  combinations  based  upon  the  fixed  groups.  But, 
the  combination  of  groups  which  performs  best  for,  say. 
Topic  57,  need  not  be  the  one  which  performs  best  for 
Topic  72.  In  Tables  6  and  6a,  the  best  possible  combina- 
tion is  chosen  for  each  topic  individually.  Note  also  that, 
in  the  INQUERY  system,  the  "unweighted  sum"  corre- 
sponds to  a  symmetrical  assignment  of  each  weight  to  all 
formulations. 


combl 

best 

comby 

fusion 

combl 

8 

4** 

15 

best 

17 

9 

21** 

comby 

21** 

16 

20** 

fusion       i  10 

4** 

5** 

**  =  significant  difference  at  p  <  .01,  sign  test 

*  =  significant  difference  at  p  <  .05,  sign  test 

Read  row  with  respect  to  column,  e.g.  comby  performed  better 

than  combl  21  times  ,  or  combl  performed  better  than  comby 

4  times. 

Table  6a.  Number  of  times  that  one  treatment  for  ad  hoc 
topics  performed  better  than  another. 

3.5  Query  Combination  and  Data  Fusion 
Results:    Routing  Topics 

We  ran  fiu-ther  experiments  on  the  routing  queries, 
analogous  to  those  we  used  for  the  ad  hoc  queries.  Our  first 
set  of  results  shows  the  progressive  effect  of  unweighted 
combination  of  query  formulations,  by  level  of  combina- 
tion, when  average  performance  at  each  level  is  considered 
(tables  7  and  7a).  Again,  as  for  the  ad  hoc  queries  (tables  4 
and  4a),  there  is  a  progressive,  significant  effect  of  level  of 
query  combination.  For  the  routing  queries,  data  fusion  ap- 
pears to  have  a  somewhat  stronger  effect  than  for  ad  hoc, 
being  significantly  better  than  1-,  2-  and  3- way  combina- 
tion. It  is  of  some  interest  to  note  that  the  overall  level  of 
performance  for  routing  topics  is  much  higher  than  for  the 
ad  hoc  topics. 


1-way 

2-way 

3 -way 

4-way 

5 -way 

fusion 

0.1763 

0.2311 

0.2599 

0.2619 

0.2807 

0.1890 

0.2202 

0.2503 

0.2748 

0.1684 

0.2258 

0.2603 

0.2735 

0.2025 

0.2229 

0.2314 

0.2512 

0.1793 

0.2436 

0.2415 

0.2745 

0.2364 

0.2471 

0.2388 

0.2509 

0.2160 

0.2654 

0.2149 

0.2642 

0.2338 

0.2417 

0.1831 

0.2283 

0.2513 

0.2672 

0.2807 

0.2661 

Each  entry  is  an  average  over  50  topics. 

Table  7.  For  routing  topics,  average  11-point  precision, 
by  group,  for  each  combination  of  queries,  and  mean  aver- 
age precision  for  all  groups  at  each  level  of  combination. 


1-way 

2-way 

3-way 

4-way 

5-way 

fusion 

1-way 

3** 

2** 

1  ** 

2** 

8** 

2-way 

47** 

5.5** 

6** 

5.5** 

13** 

3-way 

48** 

44.5** 

9** 

7** 

18* 

4-way 

49** 

44** 

41  ** 

8** 

22.5 

5-way 

48** 

44.5** 

43** 

42** 

28 

fusion 

42** 

37** 

32* 

27.5 

22 

**  =  significant  difference  at  p  <  .01,  sign  test 

*  =  significant  difference  at  p  <  .05,  sign  test 

Read  row  with  respect  to  column,  e.g.  2-way  performed  better 

than  1-way  47  out  of  50  times,  or  1-way  performed  better  than 

2-way  3  out  of  50  times 

Table  7a.  Number  of  times,  for  average  performance  of 
combinations  for  routing  topics,  that  one  treatment  per- 
formed better  than  another. 


As  for  the  ad  hoc  topics,  we  then  compared  the  results 
of  the  best  query  formulation  combinations  for  each  level  of 
combination,  with  the  unweighted  5-way  combination,  and 
fusion  results.  As  for  the  ad  hoc  queries,  this  gave  us  quite 
a  different  ranking  of  levels  of  combination,  with  3-way  and 
2-way  combinations  being  significantly  better  than  all  oth- 
ers, and  4-way  being  significantly  better  than  5-way  and  fu- 
sion (tables  8  and  8a). 


1-way 

2-way 

3-way 

4-way 

5-way 

fusion 

0.2931 

0.3173 

0.3199 

0.3069 

0.2807 

0.2661 

Table  8.  For  routing  topics,  mean  11-point  precision  for 
best-performing  combination  of  queries  for  each  topic. 


1-way 

2-way 

3-way 

4-way 

5-way 

fusion 

1-way 

8.5** 

13.5** 

22 

29 

36** 

2-way 

41.5** 

20.5 

34* 

38** 

39** 

3-way 

36.5** 

29.5 

37** 

42** 

45** 

4-way 

28 

16* 

13** 

44** 

40** 

5-way 

21 

12** 

8** 

6** 

28 

fusion 

14** 

11** 

5** 

10** 

22 

**  =  significant  difference  at  p  <  .01,  sign  test 

*  =  significant  difference  at  p  <  .05,  sign  test 

Read  row  with  respect  to  column,  e.g.  2-way  performed  better 

than  1-way  41.5  times,  or  1-way  performed  better- than  2-way 

8.5  times 

Table  8a.  Number  of  times,  for  performance  of  best  com- 
binations for  routing  topics,  that  one  treatment  performed 
better  than  another. 


3.6  Adaptive  Combination:  Routing 
Topics 

Finally,  we  wished  to  investigate  the  effectiveness  of 
progressively  taking  account  of  retrieval  performance  in 
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modification  of  the  query  formulation.  To  do  this,  we 
compared  performance  of  unweighted  5-way  query  combina- 
tion (combl)  with  performance  using  the  best-  performing 
query  formulations  in  the  training  database  (bestl),  the  best 
performing  query  formulations  in  the  test  database  (best2), 
the  weighted  5-way  query  combination  using  weights  from 
the  training  database  (combx),  the  weighted  5-way  query 
combination  using  weights  from  the  test  database  (comby), 
and  5-way  query  combination  weighted  by  the  mean  of  the 
weights  for  test  and  training  databases.  The  weights  diat 
we  used  were  the  precision  at  100  retrieved  documents  for 
each  query  formulation.  In  the  official  results,  we  used  av- 
erage 11 -point  precision.  The  reason  for  the  change,  is  that 
precision  at  some  cutoff  level  is  a  realistic  measure  for  the 
routing  task  in  general,  and  especially  in  an  operational  en- 
vironment, whereas  the  average  precision  is  a  measure  that 
we  cannot  realistically  expect  to  have  in  an  operational  en- 
vironment. When  we  compared  the  performance  of  both 
weights  in  the  combx  formulation,  there  was  no  significant 
difference.  The  results  are  presented  in  tables  9  and  9a,  and 
show  that  taking  account  of  subsequent  evidence  has  a  posi- 
tive and  significant  effect  on  performance.  When  reading 
Tables  9  and  9a,  note  that  the  entries  for  combl  and  fusion 
have  already  appeared  in  Table  7,  as  "5-way"  and  "fusion", 
respectively.  Also,  "best2"  has  already  appeared  in  Table  8, 
as  the  best  "1-way"  combination. 


combl 

bestl 

best2 

combx 

comby 

comxy 

fusion 

.2807 

.2721 

.2931 

.3012 

.3090 

.3068 

.2661 

combl  =  xmweighted  combination  of  all  queries  for  each  topic 
bestl  =  best  performing  query  (on  training  set)  for  each  topic 
best2  =  best  performing  query  (on  test  set)  for  each  topic 
combx  =  weighted  (by  prec.@100  docs  in  training  set)  combi- 
nation of  all  queries  for  each  topic 

comby  =  weighted  (by  prec.@100  docs  in  test  set)  combination 
of  all  queries  for  each  topic 

combxy  =  weighted  (by  mean  of  the  sum  of  prec.@100  docs  in 
training  and  test  sets)  combination  of  all  queries  for  each  topic 

Table  9.  For  routing  topics,  mean  11 -point  precision  for 
seven  treatments. 


Table  9a  encapsulates  all  of  the  key  concepts  of  the 
several  approaches  to  combination  that  we  have  explored. 
We  have  two  approaches  which  are  a  priori  and  symmetric 
in  their  treatment  of  the  query  formulations  (fus  and 
combl).  As  expected,  the  fusion  system,  using  the  least 
information,  performs  worse,  combl,  the  symmetric  for- 
mulations does  better,  although  the  difference  is  not  statis- 
tically significant.  Both  of  these  methods  often  perform 
better  than  the  best  of  the  individual  formulations,  and  their 
relations  to  other  combination  schemes  are  (except  for  the 
relation  to  best2)  quite  similar.  The  query  that  performs 
best  on  the  training  set  (bestl)  does  not  perform  signifi- 
cantly better  than  any  of  the  combination  schemes.  But 
that  formulation  which  performs  best  on  the  test  set  (best2, 
also  called  1-way  in  Table  8)  is  significantly  better  than 
bestl  and  the  fusion  scheme. 


coml 

bestl 

best2 

comx 

comy 

cxy 

fus 

coml 

29 

21 

13.5 
*  * 

16* 

14.5 
*  * 

28 

bestl 

21 

13** 

14** 

14** 

12.5 
*  * 

22.5 

best2 

29 

37** 

23 

20 

23 

36** 

comx 

36.5 
*  * 

36** 

27 

21.5 

18** 

40** 

comy 

34* 

36** 

30 

28.5 

25.5 

36.5 
*  * 

cxy 

35.5 
*  * 

37.5 
*  * 

27 

32* 

24.5 

37** 

fus 

22 

27.5 

14** 

10** 

13.5 
*  * 

13** 

**  =  significant  difference  at  p  <  .01,  sign  test 

*  -  significant  difference  at  p  <  .05,  sign  test 

Read  row  with  respect  to  column,  e.g.  combx  performed  better 

than  combl  36.5  times,  or  combl  better  than  combx  13.5 

times. 

Table  9a.  Number  of  times  that  one  treatment  for  routing 
topics  performed  better  than  another. 


Of  greater  interest  are  the  methods  representing  adaptive 
weighting  schemes:  combx,  comby  and  combxy.  Most 
significantly,  combx,  the  adaptive  weighting  formulation, 
is  better  than  the  symmetrically  weighted  combination 
(combl),  the  fusion  rule,  and  the  best  single  formulation  in 
a  substantial  fraction  (over  70%)  of  all  cases.  The  weight- 
ing based  on  the  test  set  (comby)  stands  also  in  essentially 
the  same  relation  to  those  three  other  schemes.  Finally,  the 
weighting  scheme  combxy  simulates  a  situation  which 
might  arise  in  updating  or  tuning  a  combination  rule  after 
two  batches  of  documents  have  been  retrieved.  This  is  ac- 
complished by  averaging  the  weights  assigned  to  each  for- 
mulation in  the  training  run,  with  those  assigned  based  on 
the  test  run.  This  scheme  shows  essentially  the  same  pro- 
file as  combx  and  comby  when  compared  with  the  combl, 
fusion,  bestl  and  best2  schemes.  It  performs  significantly 
better  than  combx,  but  not  significantly  better  than  comby. 

4.  Discussion 

4.1    General  Results 

As  is  customary,  we  begin  this  section  with  a  general 
disclaimer.  In  this  case,  we  need  to  point  out  that  all  of  our 
results  were  obtained  with  a  very  specific  kind  of  query- 
formulation  technique  and  very  special  kinds  of  queries,  and, 
that  all  of  our  results  were  obtained  within  a  very  special  re- 
trieval context,  the  INQUERY  system.  It  is  certainly  pos- 
sible that  these  circumstances  strongly  affected  our  results, 
so  that  we  cannot  make  widely  general  claims  for  them. 
On  the  other  hand,  the  results  reported  by  Fox  and  Shaw 
(this  volume),  using  queries  generated  in  quite  different 
ways,  and  using  a  quite  different  IR  system  and  retrieval 
technique,  are  quite  similar  in  general  form  and  trend  to 
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ours,  although  their  specific  figures  are  different.  So  we  are 
willing  to  believe  that  the  influence  of  our  experimental  si- 
tuation is  probably  not  enough  to  invalidate  our  results  at 
some  level  of  generality. 

There  are  several  aspects  of  our  general  results  which 
are  of  some  interest,  apart  from  the  issues  of  query  combi- 
nation and  data  fusion.  One  has  to  do  with  the  lack  of  any 
significant  relationship  between  number  of  words  in  a 
query,  and  the  performance  of  a  query.  It  has  been  at  least 
informally  suggested  in  the  IR  community,  that  the  re- 
trieval performance  of  queries  increases  with  the  number  of 
words  in  the  query.  There  is  no  support  in  our  data  for  this 
hypothesis.  Indeed,  in  our  data,  there  is  at  least  one  one- 
word  query,  which  performed  better  than  all  of  the  other, 
multi-word  queries  for  that  topic. 

It  is  also  of  interest  that  none  of  the  query/searcher 
characteristics  was  related  to  performance.  This  may  be  a 
characteristic  of  our  particular  data  set,  but  it  also  suggests 
that  it  will  be  rather  difficult  to  identify  characteristics  of 
people  or  topics  at  this  level,  which  will  be  predictive  of 
performance  of  the  query. 

Although  the  level  of  familiarity  by  the  searchers  on 
the  topics  was  in  general  rather  low,  our  searchers  neverthe- 
less found  it  not  too  difficult  to  formulate  queries  (mean  of 
2.82  on  a  scale  of  1  to  5),  and  felt  that  they  had  sufficient 
information  to  construct  a  reasonable  query,  on  the  basis  of 
the  topic  (mean  of  3.2  on  a  scale  1  of  5).  This  makes  us 
think  that  the  queries  are  likely  to  be  reasonable  formula- 
tions of  the  search  topics,  at  least  as  far  as  the  searchers  are 
concerned.  But  the  range  and  variability  of  the  numbers  of 
words  and  numbers  of  operators  per  topic  seems  to  indicate 
that  the  query  formulations  themselves  are  rather  different 
(we  have  not  yet  compared  them  for  overlap  in  specific 
words,  but  work  on  this  issue  is  in  progress).  These  two 
results  seem  to  us  to  confirm  our  initial  idea  that  each  query 
formulation  is  indeed  a  "different"  interpretation  of  the  in- 
formation problem,  and  thus  to  substantiate  our  general  ap- 
proach. 

4.2    Query  Combination  Results 

Our  results,  for  both  ad  hoc  and  routing  topics,  seem 
clearly  to  show  that,  in  general,  the  more  evidence  one  has, 
and  uses,  in  the  form  of  different  query  formulations,  the 
better  the  IR  performance  is  going  to  be.  In  particular,  ta- 
bles 4,  6,  7  and  9  support  this  conclusion,  in  various  re- 
spects. From  the  results  of  tables  6  and  9,  we  can  see  that, 
taking  advantage  of  what  one  learns  about  query  perfor- 
mance from  one  iteration  doesn't  help  a  lot,  after  the  first 
iteration,  but  on  the  other  hand,  it  doesn't  hurt,  either.  This 
suggests  to  us  that  continual  modification  and  reweighting 
of  the  multiple  query  formulations  in  a  combined  query,  is 
likely  to  be  useful  in  the  general  routing  environment.  But 
even  doing  it  once,  given  the  initial  evidence,  seems  to 
help.  This  also  suggests  that  continuing  to  add  new  query 
formulations  to  a  combined  query  will  likely  help  perfor- 
mance on  subsequent  runs. 

Having  said  all  this,  it  is  worth  considering  the  results 


of  tables  5  and  8,  which  showed  that  picking  the  best  2-way 
or  3 -way  combination  of  query  formulations  was  signifi- 
cantly better  than  using  4-way  or  5-way  combinations.  On 
the  face  of  it,  this  runs  counter  to  the  general  result  of  "the 
more,  the  better".  However,  it  is  possible  that  this  result  is 
an  artifact  of  our  data.  For  both  2-way  and  3 -way  combina- 
tions, it  was  possible  to  choose  the  best  from  ten  different 
combinations.  Because  we  had  only  five  different  query 
formulations  for  each  topic,  we  had  smaller  pools  from 
which  to  choose,  for  both  single  query  formulations,  and 
for  the  4-way  and  5-way  combinations.  This  issue  needs 
further  investigation. 

4.3    Data  Fusion  Results 

There  are  several  points  to  be  made  with  regard  to  the 
median-fusion  scheme  as  implemented  here.  First,  as  may 
be  expected  from  general  arguments,  it  sometimes  performs 
better  than  the  best  of  the  lists  which  are  joined  by  the  fu- 
sion process.  Second,  it  does  not  perform  as  well  as  even 
the  symmetric  (unweighted)  combinations  made  using  the 
internal  scores  generated  by  the  INQUERY  system.  This  is 
expected,  since  those  scores  contain  more  information  than 
the  rankings  alone.  One  can  imagine  special  cases  in  which 
the  distribution  of  scores  assigned  to  a  document  by  several 
queries  is  such  that  the  internal  combination  rule  of  un- 
weighted sum  does  not  perform  as  well,  but  this  has  appar- 
ently not  occurred  in  the  cases  studied  here. 

Third,  in  the  application  to  the  routing  problem  we 
have,  in  fact,  operated  in  a  batch  mode.  For  a  true  routing 
situation,  it  would  be  necessary  to  estimate  cutoff  scores  for 
the  several  query  formulations,  corresponding  to  the  cutoff 
rank  on  the  fused  list.  For  large  data  sets  this  can  be  done 
easily.  Without  this  step,  it  is  not  possible  to  make  an 
immediate  decision  about  a  newly  presented  document. 

Fourth,  the  stability  induced  by  using  this  system  was 
manifested  in  the  case  of  the  one  query  for  which  we  dis- 
covered, too  late  to  make  the  change,  that  one  of  the  query 
formulations  was  in  error.  For  this  case,  all  of  the  "average 
combination  of  evidence  formulations"  performed  more 
poorly  than  the  fusion  rule.  This  is  because  one,  or  even 
two  disastrously  bad  query  formulations  will  have  little  ef- 
fect on  the  results  of  the  3-of-5  fusion  rule.  Of  course,  ex- 
pect for  the  case  of  combining  all  five  query  formulations, 
the  best  of  one,  two,  three  or  four  query  formulations  can 
do  well  because  the  one  bad  formulation  will  be  missing 
from  the  combinations  that  are  best. 

Finally,  the  application  of  data  fusion  here,  at  the  so- 
called  decision  level  (that  is  to  say,  after  the  documents 
have  been  ranked  according  to  several  rules)  is  a  simulation 
for  the  case  to  which  it  should  be  appUed.  Since  the  spe- 
cific system  that  we  used  permits  internal  manipulation  of 
scores,  there  is  no  need  to  delay  combination  until  after  the 
output  lists  have  been  formed.  But  in  realistic  settings, 
several  distinct  systems  will  have  internal  operations  which 
are  not  compatible,  so  that,  even  if  it  were  possible  to  ex- 
tract the  internal  scores,  it  would  not  be  apparent  how  to 
combine  them. 
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5.  Conclusions 


ARPA. 


In  general,  we  conclude  that  our  initial  research  ques- 
tions with  respect  to  query  combination  have  been  posi- 
tively answered.  That  is,  if  one  has  available  several  differ- 
ent representations  of  a  single  information  problem,  then  it 
makes  sense  to  use  all  of  them,  in  combination,  in  order  to 
improve  retrieval  performance,  rather  than  to  try  to  identify 
and  use  only  the  best  one.  In  addition,  it  is  reasonably  clear 
that  progressive  and  continuous  combination  of  query  for- 
mulations leads  to  continuing  and  progressive  improvement 
of  performance.  This  may  extend  to  progressive  modifica- 
tion of  query  formulations  in  the  routing  situation,  for  in- 
stance, on  the  basis  of  each  iteration  of  retrieval.  Neverthe- 
less, some  of  our  results  appear  anomalous,  and  in  particu- 
lar we  need  to  address  more  carefully  the  issue  of  how  best 
to  combine  query  formulations. 

As  far  as  our  data  fusion  questions  are  concerned,  we 
have  clearly  demonstrated  that  doing  data  fusion  is  better 
than  using  only  one  query  formulation.  Although  perfor- 
mance improvement  in  these  experiments  was  rather  low, 
for  operational  settings  in  which  there  are  multiple  systems 
with  incompatible  scores,  a  data  fusion  method  that  works 
with  the  rar^ed  outputs,  rather  than  the  scores  is  the  precise 
method  that  is  needed.  In  the  present  study  we  have  shown 
how  that  method  can  be  extended  from  the  case  of  binary 
(set)  retrieval  to  the  case  of  ranked  lists.  We  have  shown 
that  the  results  are,  on  the  average,  better  than  the  results  of 
the  individual  formulations.  In  some  cases,  they  are  better 
than  the  best  of  the  component  formulations.  This  lends 
support  to  a  program  of  seeking  optimal  tunings  for  fusion 
of  any  number  of  given  systems,  to  achieve  results  better 
than  any  of  them  alone  could  provide. 

Overall,  we  find  strong  support  for  adaptive  weighting 
in  query  combination.  This  is  applicable  to  both  routing, 
as  shown  directly  here,  and  to  relevance  feedback,  which  we 
have  simulated  in  our  application  to  the  ad  hoc  topics.  We 
also  find  strong  support  for  enlarging  the  set  of  query  repre- 
sentations. This  success  raises  many  interesting  possibili- 
ties. For  example,  one  might  systematically  explore  the  k- 
way  combinations  to  see  how  they  compare  to  the  adaptive 
weighting  scheme.  Or,  one  might  apply  the  notion  of 
adaptive  weighting  to  the  best  of  the  k-way  combinations. 
The  possibilities  for  combining  these  two  concepts  ex- 
plodes (of  course)  combinatorially.  We  feel  that  the  present 
experiments  point  a  way  into  the  forest  of  possibilities. 
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APPENDIX:  SEARCHER  RESPONSE  SHEET 


QUERY  FORMULATION 
TOPIC  NUMBER:   PACKET  NUMBER: 


Please  formulate  a  search  query  for  one  of  the  five  topics  you  have  been  sent,  in  the  space  below.  Don't 
forget  to  indicate  the  topic  number  in  the  space  above.  Please  read  the  entire  topic  description  before  you 
begin  your  query  forumlation.  (Use  the  back  of  this  sheet,  if  there  isn't  enough  room  below) 


Please  answer  the  following  questions,  as  they  relate  to  this  specific  query  formulation. 

1.  About  how  many  minutes  did  it  take  (including  reading  the  topic  description)?   minutes 

2.  Is  this  topic  related  to  things  you  normally  search  on  (please  circle  one  number)? 

Not  at  all  Somewhat  Very  much 

3.  How  easy  was  it  to  formulate  this  query? 

Not  easy  Somewhat  Very  easy 

4.  Do  you  feel  you  had  enough  information  to  construct  an  effective  query? 

Too  little  Adequate  Plenty 

5.  About  how  many  years  have  you  been  doing  online  searching? :  years. 
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Automatic  Routing  and  Ad-hoc  Retrieval 
Using  SMART  :  TREC  2 

Chris  Buckley*  James  Allan,  and  Gerard  Salton 


Abstract 

The  Smart  information  retrieval  project  em- 
phasizes completely  automatic  approaches  to 
the  understanding  and  retrieval  of  large  quan- 
tities of  text.  We  continue  our  work  in  the 
TREC  2  environment,  performing  both  rout- 
ing and  ad-hoc  experiments.  The  ad-hoc 
work  extends  our  investigations  into  combin- 
ing global  similarities,  giving  an  overall  indica- 
tion of  how  a  document  matches  a  query,  with 
local  similarities  identifying  a  smaller  part  of 
the  document  which  matches  the  query.  The 
performance  of  the  ad-hoc  runs  is  good,  but  it 
is  clear  we  are  not  yet  taking  full  advantage  of 
the  available  local  information. 

Our  routing  experiments  use  conventional 
relevance  feedback  approaches  to  routing,  but 
with  a  much  greater  degree  of  query  expan- 
sion than  was  done  in  TREC  1.  The  length 
of  a  query  vector  is  increased  by  a  factor  of 
5  to  10  by  adding  terms  found  in  previously 
seen  relevant  documents.  This  approach  im- 
proves effectiveness  by  30-40%  over  the  origi- 
nal query. 

Introduction 

For  over  30  years,  the  Smart  project  at  Cor- 
nell University  has  been  interested  in  the  anal- 
ysis, search,  and  retrieval  of  heterogeneous 
text  databases,  where  the  vocabulary  is  al- 
lowed to  vary  widely,  and  the  subject  matter 
is  unrestricted.  Such  databases  may  include 
newspaper  articles,  newswire  dispatches,  text- 
books, dictionaries,  encyclopedias,  manuals, 
magazine  articles,  and  so  on.  The  usual  text 
analysis  and  text  indexing  approaches  that  are 
based  on  the  use  of  thesauruses  and  other  vo- 
cabulary control  devices  are  difficult  to  apply 

*  Department  of  Computer  Science,  Cornell  Univer- 
sity, Ithaca,  NY  14853-7501.  This  study  was  sup- 
ported in  part  by  the  National  Science  Foundation  un- 
der grant  IRI  89-15847. 


in  unrestricted  text  environments,  because  the 
word  meanings  are  not  stable  in  such  circum- 
stances and  the  interpretation  varies  depend- 
ing on  context.  The  applicability  of  more  com- 
plex text  analysis  systems  that  are  beised  on 
the  construction  of  knowledge  bases  covering 
the  detailed  structure  of  particular  subject  ar- 
eas, together  with  inference  rules  designed  to 
derive  relationships  between  the  relevant  con- 
cepts, is  even  more  questionable  in  such  cases. 
Complete  theories  of  knowledge  representation 
do  not  exist,  and  it  is  unclear  what  concepts, 
concept  relationships,  and  inference  rules  may 
be  needed  to  understand  particular  texts. [11] 
Accordingly,  a  text  analysis  and  retrieval 
component  must  necessarily  be  based  primar- 
ily on  a  study  of  the  available  texts  themselves. 
Fortunately  very  large  text  databases  are  now 
available  in  machine-readable  form,  and  a  sub- 
stantial amount  of  information  is  automati- 
cally derivable  about  the  occurrence  properties 
of  words  and  expressions  in  natural-language 
texts,  and  about  the  contexts  in  which  the 
words  are  used.  This  information  can  help  in 
determining  whether  a  query  and  a  text  are  se- 
mantically  homogeneous,  that  is,  whether  they 
cover  similar  subject  areas.  When  that  is  the 
case,  the  text  can  be  retrieved  in  response  to 
the  query. 

Automatic  Indexing 

In  the  Smart  system,  the  vector-processing 
model  of  retrieval  is  used  to  transform  both 
the  available  information  requests  as  well  as 
the  stored  documents  into  vectors  of  the  form: 

Di  =  {Wii,Wi2,  .  .  .  ,  Wit) 

where  Di  represents  a  document  (or  query) 
text  and  Wik  is  the  weight  of  term  Tk  in  doc- 
ument Di .  A  weight  of  zero  is  used  for  terms 
that  are  absent  from  a  particular  document, 
and  positive  weights  characterize  terms  actu- 
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ally  assigned.  The  cissumption  is  that  t  terms 
in  all  are  available  for  the  representation  of  the 
information. 

In  choosing  a  term  weighting  system,  low 
weights  should  be  assigned  to  high-frequency 
terms  that  occur  in  many  documents  of  a  col- 
lection, and  high  weights  to  terms  that  are  im- 
portant in  particular  documents  but  unimpor- 
tant in  the  remainder  of  the  collection.  The 
weight  of  terms  that  occur  rarely  in  a  collec- 
tion is  relatively  unimportant,  because  such 
terms  contribute  little  to  the  needed  similar- 
ity computation  between  different  texts. 

A  well-known  term  weighting  system  fol- 
lowing that  prescription  assigns  weight  Wik  to 
term  Tk  in  query  Qi  in  proportion  to  the  fre- 
quency of  occurrence  of  the  term  in  Qi ,  and 
in  inverse  proportion  to  'the  number  of  doc- 
uments to  which  the  term  is  assigned. [12,  10] 
Such  a  weighting  system  is  known  as  a  tf  x  idf 
(term  frequency  times  inverse  document  fre- 
quency) weighting  system.  In  practice  the 
query  lengths,  and  hence  the  number  of  non- 
zero term  weights  assigned  to  a  query,  varies 
widely.  To  allow  a  meaningful  final  retrieval 
similarity,  it  is  convenient  to  use  a  length  nor- 
mahzation  factor  as  part  of  the  term  weighting 
formula.  A  high-quality  term  weighting  for- 
mula for  Wik ,  the  weight  of  term  Tk  in  query 
Qi  is 

(log(/ifc)  +  1.0)*log(iV/nfc) 
Wik  —  — ^===^^^^===^^=^=^=^ 

\/ELi[(log(/ifc)  +  1-0)  *  log(iV/n,)]2 

(1) 

where  fik  is  the  occurrence  frequency  of  Tk  in 
Qi,  N  is  the  collection  size,  and  nk  the  num- 
ber of  documents  with  term  Tk  assigned.  The 
factor  log(N/nk)  is  an  inverse  collection  fre- 
quency ( "idf )  factor  which  decreases  as  terms 
are  used  widely  in  a  collection,  and  the  denom- 
inator in  expression  (1)  is  used  for  weight  nor- 
malization. This  particular  form  will  be  called 
"Itc"  weighting  within  this  paper. 

The  weights  assigned  to  terms  in  documents 
are  much  the  same.  In  practice,  for  both  effec- 
tiveness and  efficiency  reasons  the  idf  factor  in 
the  documents  is  dropped. [1] 

The  terms  Tk  included  in  a  given  vector  can 
in  principle  represent  any  entities  assigned  to 
a  document  for  content  identification.  In  the 
Smart  context,  such  terms  are  derived  by  a 
text  transformation  of  the  following  kind:[10] 

1.  recognize  individual  text  words 


2.  use  a  stop  list  to  eliminate  unwanted  func- 
tion words 

3.  perform  suffix  removal  to  generate  word 
stems 

4.  optionally  use  term  grouping  methods 
based  on  statistical  word  co-occurrence 
or  word  adjacency  computations  to 
form  term  phrases  (alternatively  syntac- 
tic analysis  computations  can  be  used) 

5.  assign  term  weights  to  all  remaining  word 
stems  and/or  phrase  stems  to  form  the 
term  vector  for  all  information  items. 

Once  term  vectors  are  available  for  all  informa- 
tion items,  all  subsequent  processing  is  based 
on  term  vector  manipulations. 

The  fact  that  the  indexing  of  both  doc- 
uments and  queries  is  completely  automatic 
means  that  the  results  obtained  are  reasonably 
collection  independent  and  should  be  valid 
across  a  wide  range  of  collections.  No  human 
expertise  in  the  subject  matter  is  required  for 
either  the  initial  collection  creation,  or  the  ac- 
tual query  formulation. 

Phrases 

The  same  phrase  strategy  (and  phrases)  used 
in  TREC  1  ([1])  is  used  for  TREC  2.  Any 
pair  of  adjacent  non-stopwords  are  regarded 
as  potential  phrases.  The  final  list  of  phrases 
is  composed  of  those  pairs  of  words  occur- 
ring in  25  or  more  documents  of  the  initial 
TREC  1  document  set  (Dl,  TREC  1  initial 
collection).  Phrase  weighting  is  again  a  hy- 
brid scheme  where  phrases  are  weighted  with 
the  same  scheme  as  single  terms,  except  that 
normalization  of  the  entire  vector  is  done  by 
dividing  by  the  length  of  the  single  term  sub- 
vector  only.  In  this  way,  the  similarity  cori- 
tribution  of  the  single  terms  is  independent  of 
the  quantity  or  quality  of  the  phrases. 

Text  Similarity  Computation 

When  the  text  of  document  Z),-  is  represented 
by  a  vectors  of  the  form  [dn,  di2,  • . . ,  dn)  and 
query  Qj  by  the  vector  {qji,qj2,  ■  ■  ■  ,qjt),  a 
similarity  (S)  computation  between  the  two 
items  can  conveniently  be  obtained  as  the  in- 
ner product  between  corresponding  weighted 
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term  vector  as  follows: 

t 

S{Di,Qj)  =  ^{da*qjk)  (2) 
ife=i 

Thus,  the  similarity  between  two  texts 
(whether  query  or  document)  depends  on  the 
weights  of  coinciding  terms  in  the  two  vectors. 

Information  retrieval  and  text  linking  sys- 
tems based  on  the  use  of  global  text  similar- 
ity measures  such  as  that  of  expression  (2) 
will  be  successful  when  the  common  terms  in 
the  two  vectors  are  in  fact  used  in  seman- 
tically  similar  ways.  In  many  cases  it  may 
happen  that  highly-weighted  terms  that  con- 
tribute substantially  to  the  text  similarity  are 
semantically  distinct.  For  example,  a  sound 
may  be  an  audible  phenomenon,  or  a  body  of 
water. 

TREC  1  ([!])  demonstrated  that  local  con- 
texts could  be  used  to  disambiguate  word 
senses,  for  example  rejecting  documents  about 
"industrial  salts"  when  given  a  query  about 
the  "SALT  peace  treaty".  Overall,  however, 
the  improvement  in  effectiveness  due  to  local 
matching  was  minimal  in  TREC  1 .  One  reason 
for  this  is  the  richness  of  the  TREC  queries. 
Global  text  matching  is  almost  invariably  suf- 
ficient for  disambiguation.  Another  reason  is 
the  homogeneity  of  the  queries.  They  deal  pri- 
marily with  two  subjects:  finance,  and  science 
and  technology.  Within  a  single  subject  area 
vocabulary  is  more  standardized  and  ambigu- 
ity is  therefore  minimized. 

One  other  potential  reason  for  the  unexpect- 
edly slight  improvement  is  that  most  of  the  in- 
formation from  local  matches  is  simply  being 
thrown  away.  Local  matches  are  used  as  a  fil- 
ter to  reject  documents  that  do  not  satisfy  a 
local  criteria:  the  overall  global  similarity  used 
for  ranking  is  changed  only  by  the  addition  of 
a  constant  indicating  the  local  match  criteria 
was  satisfied.  The  positive  information  that  a 
long  document  might  have  a  single  paragraph 
which  very  closely  matched  the  query  is  ig- 
nored. 

For  TREC  2,  we  look  at  combining  global 
and  local  similarities  into  a  single  final  simi- 
larity to  be  used  for  ranking  purposes. 

The  other  focus  of  our  TREC  2  work  is  tak- 
ing advantage  of  the  vast  quantity  of  relevance 
judgements  available  for  the  routing  experi- 
ments. In  TREC  1,  the  relevance  information 
was  fragmentary  and  even  occasionally  incor- 


rect. It  was  hard  to  use  this  information  in  a 
reasonable  fashion.  Happily,  the  results  of  the 
TREC  1  experiments  furnished  a  large  num- 
ber of  very  good  relevance  judgements  to  be 
used  for  TREC  2.  Conventional  vector-space 
feedback  methods  of  query  expansion  and  re- 
weighting  are  tuned  for  the  TREC  environ- 
ment in  the  routing  portion  of  TREC  2. 

System  Description 

The  Cornell  TREC  experiments  use  the 
SMART  Information  Retrieval  System,  Ver- 
sion 11,  and  are  run  on  a  dedicated  Sun  Sparc 
2  with  64  Mbytes  of  memory  and  5  Gbytes  of 
local  disk. 

SMART  Version  11  is  the  latest  in  a  long 
line  of  experimental  information  retrieval  sys- 
tems, dating  back  over  30  years,  developed  un- 
der the  guidance  of  G.  Salton.  Version  11  is 
a  reasonably  complete  re-write  of  earlier  ver- 
sions, and  was  designed  and  implemented  pri- 
marily by  C.  Buckley.  The  new  version  is  ap- 
proximately 44,000  lines  of  C  code  and  docu- 
mentation. 

SMART  Version  11  off"ers  a  basic  frame- 
work for  investigations  into  the  vector  space 
and  related  models  of  information  retrieval. 
Documents  are  fully  automatically  indexed, 
with  each  document  representation  being  a 
weighted  vector  of  concepts,  the  weight  indi- 
cating the  importance  of  a  concept  to  that  par- 
ticular document  (as  described  above).  The 
document  representatives  are  stored  on  disk  as 
an  inverted  file.  Natural  language  queries  un- 
dergo the  same  indexing  process.  The  query 
representative  vector  is  then  compared  with 
the  indexed  document  representatives  to  ar- 
rive at  a  similarity  (equation  (2)),  and  the  doc- 
uments are  then  fully  ranked  by  similarity. 

Ad-hoc  Results 

Cornell  submitted  two  runs  in  the  ad-hoc  cate- 
gory. The  first,  crnlV2,  is  a  very  simple  vector 
comparison.  The  second,  cmlL2,  makes  use 
of  simplified  least  squares  analysis  and  a  train- 
ing set  to  combine  global  similarity  and  part- 
wise  similajities  in  a  meaningful  ratio.  Both 
systems  performed  at  or  above  the  median  in 
almost  all  queries,  as  can  be  seen  in  in  Table  1. 
The  crnlV2-b  run  is  the  same  as  the  official 
crnlV2  run,  but  with  an  error  in  the  experi- 
mental procedure  corrected  (discussed  below). 
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Run 

Best     >  median     <  median 

crnlV2 
crnlL2 
crnlV2-b 

1            38  11 
4           40  6 
9            36  5 

Table  1:  Comparative  Ad-hoc  results 


Global  Similarity 

The  crnlV2  run  demonstrates  the  quality  of 
results  obtainable  with  simple  methods.  The 
weighting  for  terms  is  chosen  based  upon  re- 
sults from  TREC  1.  Query  terms  are  weighted 
by  the  formula  in  equation  (1)  ("Itc"  in 
Smart's  vocabulary).  Document  terms  are 
weighted  using  a  normalized  logarithmic  term 
frequency  ("Inc"): 

log  fik  +  1.0  .  , 

dik  =     I  =  (3) 

where  dik  is  the  weight  of  term  Tk  in  document 
Di,  fik  is  the  occurrence  frequency  of  term 
Tk  in  document  Di ,  and  /  is  the  total  number 
of  terms  in  the  collection.  The  denominator 
provides  normalization  of  vector  length.  Note 
the  absence  of  the  "idf"  factor  log{N/nk). 

Table  2  shows  the  results  of  that  weight- 
ing scheme  in  crnlV2-b.  Regrettably,  be- 
cause of  an  oversight  during  the  official  run 
(a  misnamed  inverted  file),  the  official  submit- 
ted run  crnlV2  did  not  use  the  weighting  ap- 
proach described  above  (and  recommended  in 
our  TREC  1  report).  Instead  crnlV2  acciden- 
tally used  the  "idf"  factor  in  both  the  query 
and  the  document.  That  mistake  caused  10% 
loss  in  retrieval  effectiveness  (from  a  recall- 
precision  average  of  0.3512  to  0.3163). 

Global  and  Local 

Cornell's  TREC  1  ad-hoc  submission  in- 
creased the  similarity  measure  of  a  query 
and  document  if  some  sentence  in  the  query 
matched  some  sentence  in  the  document  suf- 
ficiently well.[l]  The  result  was  that  any 
query /document  pair  which  contained  a  sen- 
tence match  was  retrieved  before  all  that  did 
not  have  such  a  match.  For  TREC  2,  we  hoped 
to  find  a  less  restrictive  balance  between  the 
global  and  local  similarities.  At  the  same  time, 
we  wished  to  investigate  local  similarities  us- 


ing parts  other  than  sentences,  and  to  investi- 
gate combining  muHtple  local  similarities. 

Our  approach  is  similar  to  that  used  in  [4]. 
We  built  a  training  collection  using  the  50 
queries  from  Q2  and  the  74,520  documents 
from  the  Wall  Street  Journal  included  in  D2. 
For  each  of  the  3.7  million  query/ document 
pairs,  we  calculate  the  global  similarity  and 
some  set  of  local  similarity  values.  The  least 
squares  polynomials  (LSP)  approach  devel- 
oped for  [4]  are  used  to  find  the  "ideal"  co- 
efficients for  the  global  and  local  values  in  the 
equation: 

sim  =  a  ■  global  -|-  /?i  •  locali  +  /?2  •  local2  -|-  . . . 

(The  LSP  functions  actually  yield  a  constant 
factor  which  we  ignore  since  it  does  not  affect 
ranking.) 

We  consider  local  values  from  the  following 
broad  classes: 

•  Comparing  sentences  of  the  query  against 
sentences  of  the  document.  In  general,  we 
use  a  simple  "tfxidf"  weight  without  nor- 
malization, though  we  experimented  with 
other  weights. 

•  Comparing  paragraphs  of  the  query 
against  paragraphs  of  the  document.  (For 
the  most  part,  each  section  of  the  query 
topic  is  a  separate  paragraph.)  In  this 
case,  we  use  the  weighting  of  equation  1 
above  for  the  query  paragraphs,  and  try  a 
variety  of  weights  for  the  document  para- 
graphs. 

•  Comparing  the  query  against  paragraphs 
of  the  document.  We  use  the  same 
weighting  schemes  as  above. 

We  also  tried  combinations  of  the  above  cate- 
gories: e.g.,  the  best  matching  paragraph  pair 
and  the  best  matching  sentence  pair.  See  Ta- 
ble 3  for  a  complete  list  of  local  values  that 
were  considered. 

In  all,  we  tried  72  combinations  of  local  and 
global  values,  using  from  one  local  value  to  19 
different  local  values.^  The  LSP-determined 
a  and  /?i's  of  those  values  are  then  applied 
to  a  retrieval  run  on  that  same  set  of  queries 
and  documents.  The  top  performing  result  in- 
cludes only  a  single  local  value:  the  best  match 

^  There  were  roughly  1.2  million  possible  combina- 
tions; we  chose  72  that  seemed,  based  on  earlier  exper- 
iments, Ukely  to  succeed. 
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Run 

R-prec 

Total  Rel 

recall-prec 

crnlV2 

3640 

8018 

3163 

crnlV2-b, 

4053 

8256 

3512 

crnlV2-b,  (no  not's 

4061 

8254 

3560 

crnlL2 

3641 

8224 

3258 

crnlL2-b 

3922 

8379 

3538 

sentence  restricted 

3960 

8252 

3477 

Table  2:  Ad-hoc  results 


of  the  query  against  the  paragraphs  of  the  can- 
didate document  (Ill.a.l  from  Table  3),  with 
the  query  terms  weighted  1  iff  present,  and 
the  document  terms  weighted  using  formula  1 
above  (that  used  by  the  query  in  the  global 
similarity). 

We  then  use  the  global/local  values  in  a  se- 
ries of  retrieval  runs  using  the  same  queries 
but  against  the  entire  TREC  1  document  set 
(D12).  We  tried  a  range  of  a  and  /3  values  and 
use  the  best  values  for  the  official  run,  crnlL2. 
The  formula  used  for  crnlL2  is: 

sim  =  100  •  global  +  16  ■  local 

where  "global"  is  the  query/ document  similar- 
ity described  above  ("Itc-lnc"),  and  "local"  is 
the  top  query/paragraph  similarity. 

It  takes  roughly  5  hours  clock  time  to  de- 
termine the  suggested  weighting  coefficients, 
though  multiple  combinations  of  values  could 
be  weighted  simultaneously — in  one  case,  we 
calculated  each  of  the  48  possible  local  vari- 
ables simultaneously.  Each  of  the  retrospec- 
tive runs  takes  from  60  to  90  minutes  to  run, 
depending  on  its  complexity.  These  runs  take 
an  unusually  large  amount  of  time  (compared 
to  crnlV2)  since  they  require  re-indexing  from 
scratch  a  large  number  of  documents.  The  ba- 
sic procedure  is  to  discover  the  top  1750  doc- 
uments for  each  query  using  the  global  sim- 
ilarity. Then  each  of  those  documents  is  re- 
indexed,  breaking  it  down  into  its  component 
parts  (e.g.,  paragraphs).  Then  each  compo- 
nent part  is  compared  against  the  query  to 
obtain  local  similarities. 

Other  Experiments 

The  Smart  indexing  procedures  that  are  used 
in  our  experiments  do  not  analyze  the  docu- 
ments or  queries  for  negative  terms  such  as 


not.  A  query  which  explicitly  requests  doc- 
uments "no<  about  the  United  Kingdom  or 
Canada"  will  actually  match  any  document 
with  those  terms.  Removing  the  negative  key- 
words results  in  insignificant  improvement:  16 
queries  are  helped,  16  are  hurt,  all  in  only  a  mi- 
nor fashion.  These  results  suggest  that  other 
terms  in  the  query  were  more  important  for 
locating  the  relevant  documents. 

Earlier  experiments  with  an  on-line  ency- 
clopedia ([14,  16])  demonstrated  that  preci- 
sion can  be  improved  by  discarding  docu- 
ments which  fail  a  local  context  check  (cf.  [1] 
where  such  documents  were  merely  given  lower 
similarity  measures).  That  approach  on  the 
TREC  2  queries  and  collection  yields  almost 
exactly  the  same  performance  as  crnlV2-b 
(see  "sentence  restricted"  in  Table  2).  [1]  dis- 
cusses probable  reasons  for  the  limited  success 
of  this  method. 


Analysis  of  Ad-hoc  Results 

The  results  of  Table  2  suggest  that  there  is 
little  advantage  to  using  local  values  in  combi- 
nation with  global  matches.  From  run  crnlV2 
to  run  crnlL2  there  is  negligible  improvement, 
crnlL2  does  retrieve  an  additional  120-200  rel- 
evant documents. 

The  retrospective  runs  using  the  Wall  Street 
Journal  sub-collection  suggested  there  would 
be  greater  improvement  between  crnlV2  and 
crnlL2  than  actually  occurred.  The  most  ob- 
vious problem  is  that  the  definition  of  a  para- 
graph is  sub-collection  dependent.  Our  results 
were  tailored  to  the  WSJ  sub-collection  and 
probably  did  not  apply  well  to  the  other  sub- 
collections  where  "paragraphs"  might  be  ex- 
tremely large. 
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IN  d-IHc 

IVnm 

Pairs 

T 

X 

Qu6ry  sentences  vs.  doc  sentences 

TT 
11 

Query  pcircigraphs  vs.  doc  paragraphs 

TTT 

rOnfirp  miprv  vci    nor  ■n^rj^0TPi'nn<s 

VV  nicn 

a 

Top  matching  pair 

D 

Non-zero  matching  pairs 

c 

Pairs  where  similarity  exceeds  threshold 

A 

u. 

All  ^\^^ vc 

Value 

1 

Similarity  (avg) 

2 

Number  of  common  terms  (avg) 

3 

Top  matching  term  (avg) 

4 

Count  of  pairs 

Table  3:    Local  values  considered  for  LSP  weighting 

(all  combinations,  choosing  one  from  each  category) 


Future  work 

We  are  currently  investigating  the  use  of  re- 
gression analysis  to  find  correlation  between 
relevance  and  local  similarity  values.  Using 
such  analysis  will  allow  the  local  values  to  be 
selected  for  cause  rather  than  solely  because  of 
experience  and  intuition.  If  successful,  it  will 
also  provide  a  collection-independent  method 
of  selecting  which  local  values  are  useful.  Note 
that  this  approach  does  require  a  training  set 
of  queries  and  relevance  judgements. 

We  are  interested  in  applying  these  tech- 
niques to  the  TREC  collections  with  a  more 
useful  definition  of  "paragraph."  [17,  2]  sug- 
gest the  possibility  of  narrowing  the  search 
window  to  fixed-size  pieces,  ignoring  para- 
graph boundaries.  Hearst's  "TextTiling"  ap- 
proach ([7])  is  intriguing  for  the  topic-coherent 
units  of  text  it  produces. 

Routing 

In  this  work,  routing  queries  are  formed  in 
two  distinct  phases.  In  the  first  phase,  con- 
cepts which  occur  often  in  relevant  documents 
are  added  to  the  original  query  to  expand  the 
vocabulary  used.  In  the  second  phase,  the 
original  concepts  plus  the  added  concepts  are 
weighted  based  upon  their  occurrences  in  rel- 
evant and  non-relevant  documents. 

In  TREC  1,  query  expansion  was  a  major 
obstacle.  It  was  clear  that  only  very  limited 
expansion  was  useful,  and  indeed  the  best  au- 
tomatic routing  run  ([5])  used  no  expansion  at 


all.  Thus  the  original  plans  for  TREC  2  rout- 
ing included  extensive  investigation  into  very 
selectively  adding  concepts  to  queries. 

However,  as  work  on  TREC  2  progressed 
it  became  obvious  that  the  TREC  1  results 
were  somewhat  anomalous.  For  the  routing 
approaches  used  in  this  work,  selectivity  of 
added  terms  is  not  an  issue.  Rather,  the  more 
terms  that  are  added,  the  better  the  result — 
up  to  a  point  of  diminishing  returns.  This  re- 
sult agrees  with  our  experiences  on  the  (small) 
feedback  test  collections  that  we  have  worked 
with  in  the  past.  The  original  TREC  1  train- 
ing data  for  routing  was  extremely  sketchy 
and  the  resulting  unusual  query  expansion  re- 
sults were  probably  due  to  the  lack  of  infor- 
mation about  what  a  representative  relevant 
document  looked  like. 

The  basic  routing  approach  chosen  is  the 
feedback  approach  of  Rocchio  ([9,  13]).  Ex- 
pressed in  vector  space  terms,  the  final  query 
vector  is  the  initial  query  vector  moved  to- 
ward the  centroid  of  the  relevant  documents, 
and  away  from  the  centroid  of  the  non-relevant 
documents. 

Qnew     —         *  Qo\d 

+    B  *  average_wt_in_reLdocs 
—    C  *  average_wt_nonrel_docs 

Terms  that  end  up  with  negative  weights  are 
dropped  (less  than  3%  of  terms  were  dropped 
in  the  most  massive  query  expansion  below). 

The  parameters  of  Rocchio 's  method  are  the 
relative  importance  of  the  original  query,  the 
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relevant  documents,  and  the  non- relevant  doc- 
uments (A,B,C  above);  and  then  exactly  what 
terms  are  to  be  considered  part  of  the  final  vec- 
tor. 

In  TREC  1,  a  similar  approach  originally 
proposed  by  Ide  was  used. [8,  13]  That  seemed 
to  work  well  for  the  fragmentary  relevance  in- 
formation available  for  TREC  1.  TREC  2 
has  considerably  more  information  available — 
and  of  much  higher  quality — so  Rocchio's  ap- 
proach is  more  appropriate. 

The  best  parameter  values  to  use  for  Roc- 
chio's algorithm  were  investigated  by  splitting 
the  TREC  1  collection  into  two  parts  along 
the  natural  Dl  (TREC  1  initial  collection)  and 
D2  (TREC  1  routing  collection)  lines.  Dl 
formed  the  learning  set  and  D2  the  evaluation 
set  for  a  large  number  of  experimental  runs 
determining  these  parameters.  The  original 
TREC  1  routing  queries  (Q2)  are  expanded 
and  weighted  using  Rocchio's  algorithm  with 
the  relevance  information  from  Dl.  They  are 
then  evaluated  by  running  them  against  D2 
and  using  the  known  Q2-D2  relevance  infor- 
mation. 

Queries  are  expanded  by  adding  the  "best" 
X  single  terms  and  the  "best"  Y  phrases  to 
the  original  query.  We  used  a  simple  notion  of 
"best"  for  TREC  2:  terms  that  occurred  in  the 
most  relevant  documents  (ties  were  broken  by 
considering  the  highest  average  weight  in  the 
relevant  documents). 

There  is  a  core  set  of  158  runs  using  differ- 
ent parameter  values  for  both  expansion  and 
weighting.  Table  4  gives  the  six  parameter 
possibilities  The  trends  noticeable  in  this  in- 
vestigatory set  of  runs  are: 

1.  Overall  effectiveness  increases  strongly  as 
the  number  of  terms  added  increases,  up 
until  200  terms  at  which  point  it  starts  to 
level  off. 

2.  Phrases  are  reasonably  important  (6% 
difference)  at  low  single  term  expansion 
numbers,  but  become  less  important  at 
higher  values  (1%  difference) 

3.  As  expected,  weights  in  relevant  doc- 
uments are  far  more  important  than 
weights  in  non-relevant  documents. 

The  parameters  of  our  official  run,  crnlRl  are: 
adding  X  =  300  single  terms,  adding  Y  —  50 
phrases,  importance  of  original  query  of  A  =  8, 


importance  of  weight  in  relevant  documents  of 
B  =  16,  importance  of  weight  in  non-relevant 
documents  of  C  =  4,  and  relative  importance 
of  phrases  at  retrieval  time  of  P  =  0.5. 

Query-by-Query  Parameter  Esti- 
mation 

We  examined  the  results  for  the  158  test  rout- 
ing runs  in  more  detail,  query  by  query.  For 
each  of  the  50  queries,  we  found  the  best  test 
run.  The  results  (see  Table  5)  show  some  in- 
teresting patterns  not  brought  out  by  the  over- 
all averages.  Not  surprisingly,  the  parameters 
used  for  crnlRl  are  not  best  for  any  single 
query;  they  are  just  a  reasonable  compromise. 
There  seem  to  be  two  main  groups  of  queries: 
one  in  which  very  limited  expansion  is  use- 
ful (even  6  queries  where  no  expansion  is  pre- 
ferred); and  one  in  which  the  more  terms  are 
added,  the  better  (23  queries  with  expansion 
of  500  single  terms).  If  massive  expansion  is 
useful,  in  general  the  original  query  is  less  im- 
portant than  the  expanded  terms:  A  is  much 
less  than  B.  There  is  another  separate  dis- 
tinction between  those  queries  where  phrases 
are  useful  and  those  where  phrases  appear  use- 
less: 1  query  worked  best  adding  100  phrases, 
6  with  50  added,  2  with  10,  16  using  the  orig- 
inal phrases  only,  and  25  using  no  phrases  at 
all. 

If  we  retrospectively  choose  the  best  param- 
eters for  each  query  (something  that  cannot  be 
done  in  practice),  then  we  achieve  roughly  a 
10%  improvement.  This  is  substantial  enough 
to  actually  try  a  predictive  run,  so  our  sec- 
ond official  run  (crnlCl)  uses  query-by-query 
choice  of  parameter  values  in  a  predictive  (as 
opposed  to  retrospective)  fashion.  The  values 
given  in  Table  5  were  used. 


Routing  Results 

Both  crnlRl  and  crnlCl  do  extremely  well  in 
comparison  with  other  TREC  2  routing  runs: 


Run 

Best     >  median     <  median 

crnlRl 
crnlCl 

7           40  3 
5           45  0 

Evaluation  measures  in  Table  for  both  the  of- 
ficial and  some  non-official  runs  show  the  im- 
portance of  query  expansion.  Run  1  is  the 
base  case  original  query  only  (Itc  weights). 
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X  number  of  single  terms  to  add  (possible  values  0  to  500) 

Y  number  of  phrases  to  add  (0  to  100) 

A  relative  importance  of  original  query  (fixed  at  8) 

B  relative  importance  of  average  weight  in  relevant  documents  (4  to  48) 

C  relative  importance  of  average  weight  in  non-relevant  documents  (0  to  16) 

P  relative  importance  of  phrases  in  final  retrieval  as  compared  to  single  terms  (0,  0.5,  or  1.0) 


Table  4:  Parameters  of  routing 


Just  re-weighting  the  query  terms  according 
to  Rocchio's  algorithm  gives  a  7%  improve- 
ment. Adding  a  few  terms  (20  single  terms 
+  10  phrases)  gives  17%  improvement  over 
the  base  case,  and  expanded  by  350  (300-f  50) 
terms  results  in  a  38%  improvement. 

The  official  run  crnlCl  is  actually  a  bit  dis- 
appointing. It  only  results  in  a  3%  improve- 
ment over  the  crnlRl  run,  which  is  not  very 
significant  considering  the  eflfort  required.  Few 
people  are  going  to  keep  track  of  158  test  runs 
on  a  per  query  basis.  It  may  be  practical  to 
keep  track  of  4  or  so  main  query  variants,  but 
then  the  improvement  would  probably  be  less 
than  3%.  We  are  conducting  experiments  in 
this  area  currently. 

An  open  question  is  the  effectiveness  of 
varying  the  feedback  approach  itself  between 
queries.  Preliminary  experiments  using  Fuhr's 
RPI  ([3])  weighting  schemes  in  addition  to  the 
Rocchio  variants  show  larger  improvements. 
In  general,  RPI  (and  the  other  probabilistic 
models)  perform  noticeably  better  than  Roc- 
chio if  there  is  very  little  query  expansion, 
though  quite  a  bit  worse  under  massive  ex- 
pansion. We  expect  that  the  combination  of 
RPI  for  those  queries  with  little  expansion  and 
Rocchio  for  other  queries  will  work  well. 

One  benefit  of  the  crnlCl  run  not  entirely 
represented  by  the  evaluation  figures  is  that 
retrieval  performance  is  more  even.  Poten- 
tial mismatches  between  feedback  method  and 
query  are  far  less  likely.  crnlCl  does  reason- 
ably on  all  the  queries  (above  the  median  sys- 
tem for  every  query  when  compared  against 
the  other  systems). 

Routing  Implementation  and 
Timing 

The  original  routing  queries  are  automatically 
indexed  from  the  query  text,  and  weighted  us- 


ing the  "Itc"  weighting  scheme  (equation  (1)). 
Collection  frequency  information  used  for  the 
idf  factors  is  gathered  from  D12  documents 
only.  Relevance  information  about  potential 
query  terms  is  gathered  and  stored  on  a  per 
query  basis.  For  each  query,  statistics  (includ- 
ing relevant  and  non-relevant  frequency  and 
total  "Itc"  weights)  are  kept  about  the  1000 
most  frequently  occurring  terms  in  the  D12 
relevant  documents.  For  TREC  2,  this  is  done 
by  a  batch  run  taking  about  90  CPU  minutes. 
In  practice,  this  would  be  done  incrementally 
as  each  document  was  compared  to  the  query 
and  judged.  The  statistics  amounted  to  about 
40,000  bytes  per  query. 

Using  these  statistics,  and  the  decided  upon 
parameters  for  the  feedback  process  (A,  B, 
etc.),  actual  construction  of  the  final  query 
takes  about  0.5  seconds  per  query. 

Retrieval  times  vary  tremendously  with 
length  of  query.  We  ran  in  batch  mode,  con- 
structing an  inverted  file  for  the  entire  D3 
document  set  ("Inc"  document  weights)  and 
then  comparing  a  query  against  that  inverted, 
file.  Not  only  is  this  not  what  would  be  done 
in  practice,  but  it  is  much  less  efficient  than 
would  be  done  in  practice  given  our  massive 
expansion  of  queries:  for  each  query  in  crnlRl, 
well  over  half  the  entire  inverted  file  was  read! 
CPU  time  per  query  ranged  from  about  5  sec- 
onds (no  expansion)  to  65  seconds  (expansion 
by  500  terms). 

Conclusion 

No  firm  conclusions  can  be  reached  regarding 
the  usefulness  of  combining  local  and  global 
similarities  in  the  TREC  environment.  In 
some  limited  circumstances  minor  improve- 
ments can  be  obtained,  but  in  general  we  have 
not  (yet!)  been  able  to  take  advantage  of  the 
local  information  we  know  should  be  useful. 
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0 

fi 
o 

1  fi 

4 

66 

1.0 

200 

n 

U 

0 

o 

1  fi 

4 

67 

0.0 

500 

fi 

o 

1  fi 

4 

68 

1.0 

300 

0 

g 

48 

4 

69 

1.0 

500 

0 

8 

24 

4 

70 

1.0 

500 

50 

8 

48 

4 

71 

0.0 

500 

8 

8 

4 

72 

1.0 

300 

0 

g 

48 

4 

73 

0.0 

500 

8 

16 

4 

74 

1.0 

100 

100 

8 

8 

4 

75 

0.0 

100 

8 

24 

6 

76 

1.0 

300 

0 

8 

24 

4 

11 

0.0 

500 

8 

8 

4 

78 

0.0 

0 

8 

24 

4 

79 

0.0 

500 

g 

24 

4 

80 

1.0 

500 

0 

8 

24 

4 

81 

1.0 

500 

0 

8 

32 

4 

82 

1.0 

500 

0 

8 

32 

4 

83 

0.0 

500 

8 

94 

4 

84 

0.0 

100 

8 

36 

4 

85 

1.0 

100 

10 

8 

24 

4 

86 

0.0 

30 

8 

24 

4 

87 

1.0 

500 

0 

g 

48 

4 

88 

oo 

1  0 

'iOO 

50 

8 

48 

4 

89 

1.0 

500 

0 

8 

48 

4 

90 

0.0 

500 

8 

16 

4 

91 

1.0 

0 

0 

8 

24 

4 

92 

0.0 

100 

8 

36 

4 

93 

1.0 

100 

50 

8 

8 

4 

94 

0.0 

200 

8 

36 

4 

95 

0.0 

200 

8 

16 

8 

96 

1.0 

300 

50 

8 

48 

4 

97 

0.0 

200 

8 

36 

4 

98 

0.0 

500 

8 

24 

4 

99 

0.0 

50 

8 

24 

4 

100 

1.0 

500 

50 

8 

48 

4 

Table  5:  Optimum  routing  parameters,  query-by-query 
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X.Y 

A.B.C 

Total  RpI 

1. 

no  fdbk 

0.0 

8.0.0 

3382 

6509 

2869 

2. 

no  expand 

0.0 

8.8.4 

3531 

6849 

3087 

3. 

little  expand 

20.10 

8.8.4 

3756 

7192 

3345 

4. 

crnlRl 

300.50 

8.16.4 

4273 

7764 

3952 

5. 

crnlCl 

varies 

varies 

4367 

7808 

4091 

Table  6:  Routing  evaluation 


For  TREC  2,  this  failure  is  ameliorated  by  the 
base  level  performance  of  the  global  run.  If 
the  correct  weights  are  used,  the  effectiveness 
of  automatic  indexing  is  extremely  good. 

Automatic  massive  query  expansion  proves 
to  be  very  effective  for  routing.  Conven- 
tional relevance  feedback  techniques  are  used 
to  weight  the  expanded  queries.  Parameters 
for  the  relevance  feedback  algorithms  are  esti- 
mated both  over  all  the  queries  and  for  each 
query  individually.  The  individual  query  esti- 
mation perform  better  (3-4%)  but  by  an  in- 
sufficient amount  to  be  convincing. 
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ABSTRACT 

The  experiments  described  here  are  part  of  a  research 
program  whose  objective  is  to  develop  a  full-text 
retrieval  methodology  that  is  statistically  sound  and 
powerful,  yet  reasonably  simple.  The  methodology  is 
based  on  the  use  of  a  probabilistic  model  whose 
parameters  are  fitted  empirically  to  a  learning  set  of 
relevance  judgements  by  logistic  regression.  The 
method  was  applied  to  the  TIPSTER  data  with  opti- 
mally relativized  frequencies  of  occurrence  of  match 
stems  as  the  regression  variables.  In  a  routing 
retrieval  experiment,  these  were  supplemented  by 
other  variables  corresponding  to  sums  of  logodds  asso- 
ciated with  particular  match  stems. 

Introduction 

The  full-text  retrieval  design  problem  is  largely  a 
problem  in  the  combination  of  statistical  evidence.  With 
this  as  its  premise,  the  Berkeley  group  has  concentrated  on 
the  challenge  of  finding  a  statistical  methodology  for  com- 
bining retrieval  clues  in  as  powerful  a  way  as  possible, 
consistent  with  reasonable  analytic  and  computational 
simplicity.  Thus  our  research  focus  has  been  on  the  gen- 
eral logic  of  how  to  combine  clues,  with  no  attempt  made 
at  this  stage  to  exploit  as  many  clues  as  possible.  We  feel 
that  if  a  straightforward  statistical  methodology  can  be 
found  that  extracts  a  maximum  of  retrieval  power  from  a 
few  good  clues,  and  the  methodology  is  clearly  hospitable 
to  the  introduction  of  further  clues  in  future,  progress  will 
have  been  made. 

We  join  Fuhr  and  Buckley  (1991,  1992)  in  thinking 
that  an  especially  promising  path  to  such  a  methodology  is 
to  combine  a  probabilistic  retrieval  model  with  the  tech- 
niques of  statistical  regression.  Under  this  approach  a 


probabilistic  model  is  used  to  deduce  the  general  form  that 
the  document-ranking  equation  should  take,  after  which 
regression  analysis  is  applied  to  obtain  empirically-based 
values  for  the  constants  that  appear  in  the  equation.  In 
this  way  the  probabilistic  theory  is  made  to  constrain  the 
universe  of  logically  possible  reuieval  rules  that  could  be 
chosen,  and  the  regression  techniques  complete  the  choice 
by  optimizing  the  model's  fit  to  the  learning  data. 

The  probabilistic  model  adopted  by  the  Berkeley 
group  is  derived  from  a  statistical  assumption  of  'linked 
dependence'.  This  assumption  is  weaker  than  the  historic 
independence  assumptions  usually  discussed.  In  its  sim- 
plest form  the  Berkeley  model  also  differs  fi"om  most  tra- 
ditional models  in  that  it  is  of  'Type  0'  ~  meaning  that  the 
analysis  is  carried  out  w.r.t.  sets  of  query-document  pairs 
rather  than  w.r.t.  particular  queries  or  particular  docu- 
ments. (For  a  fuller  explanation  of  this  typology  see 
Robertson,  Maron  &  Cooper  1982.)  But  when  relevance 
judgement  data  specialized  to  the  currently  submitted 
search  query  is  available,  say  in  the  form  of  relevance 
feedback  or  routing  history  data,  the  model  is  flexible 
enough  to  accommodate  it  (resulting  in  'Type  2'  retrieval.) 

Logistic  regression  (see  e.g.  Hosmer  &  Lemeshow 
(1989))  is  the  most  appropriate  type  of  regression  for  this 
kind  of  IR  prediction.  Although  standard  multiple  regres- 
sion analysis  has  been  used  successfully  by  others  in  com- 
parable circumstances  (Fuhr  &  Buckley  op.  cit),  we 
believe  logistic  regression  to  be  logicaUy  more  appropri- 
ate for  reasons  set  forth  elsewhere  (Cooper,  Dabney  & 
Gey  1992).  Logistic  regression,  which  accepts  binary 
training  data  and  yields  probability  estimates  in  the  form 
of  logodds  values,  goes  hand-in-glove  with  a  probabilistic 
IR  model  that  is  to  be  fitted  to  binary  relevance  judgement 
data  and  whose  predictor  variables  are  themselves 
logodds. 
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The  Fundamental  Equation 

As  the  starting  point  of  our  probabilistic  retrieval 
model  we  adopt  (with  reservations  to  be  explained 
presently)  a  statistical  simplifying  assumption  called  the 
'Assumption  of  Linked  Dependence'  (Cooper  1991). 
Inmitively,  it  states  that  in  the  universe  of  all  query- 
document  pairs,  the  degree  of  statistical  dependency  that 
exists  between  properties  of  pairs  in  the  subset  of  all  rele- 
vance-related pairs  is  linked  in  a  certain  way  with  the 
degree  of  dependency  that  exists  between  the  same  pair- 
properties  in  the  subset  of  all  nonrelevance-related  pairs. 
Under  at  least  one  crude  measure  of  degree  of  dependence 
(specifically,  the  ratio  of  the  joint  probability  to  the  prod- 
uct of  individual  probabilities),  the  Unkage  in  question  is 
one  of  identity.  That  is,  the  claim  made  by  the  assumption 
is  that  the  degree  of  the  dependency  is  the  same  in  both 
sets. 

For     pair-properties  Ai  Ai^,the  mathematical 

statement  of  the  assumption  is  as  follows. 

ASSUMPTION  OF  LINKED  DEPENDENCE: 

P(Ai,...,An\  R)  _  PjAj  I  R)  PjA^  I  R) 

P(Ai,...,An  \  R)  ~  PiAi\R)^  "'  ^  PiAs\R) 

If  one  thinks  of  a  query  and  document  drawn  at  random,  R 
is  the  event  that  the  document  is  relevant  to  the  query, 
while  each  clue  i4,  is  some  respect  in  which  the  document 
is  similar  or  dissimilar  to,  or  somehow  related  to,  the 
query. 

For  most  clue-types  likely  to  be  of  interest  this  sim- 
plifying assumption  is  at  best  only  approximately  true. 
Though  not  as  implausible  as  the  strong  conditional  inde- 
pendence assumptions  traditionally  discussed  in  proba- 
bilistic IR,  still  it  should  be  regarded  with  skepticism.  In 
practice,  when  the  number  N  of  clues  to  be  combined 
becomes  large  the  assumption  can  become  highly  unreal- 
istic, widi  a  distorting  tendency  that  often  causes  the  prob- 
ability of  relevance  estimates  produced  by  the  resulting 
retrieval  model  to  be  grossly  overstated  for  documents 
near  the  top  of  the  output  ranking.  But  it  will  be  expedi- 
ent here  to  adopt  the  assumption  anyway,  on  the  under- 
standing that  later  on  we  shall  have  to  find  a  way  to  curb 
its  probability-inflating  tendencies. 

From  the  Assumption  of  Linked  Dependence  it  is 
possible  to  derive  the  basic  equation  underlying  the  proba- 
bilistic model: 


Equation  (1): 
logO(/?IAi  ,...,An)  = 

log  0(R)  +  i  [  log  0(R  I  Ai)  -  log  0(R)  ] 
1=1 

where  for  any  events  E  and  E'  the  odds  0(E  /  E')  is  by 
P(E  I E') 

definition       j       Using  this  equation,  the  evidence  of 

separate  clues  can  be  combined  as  shown  in  the  right 
side  to  yield  the  logodds  of  relevance  based  on  all  clues, 
shown  on  the  left.  Query-document  pairs  can  be  ranked 
by  this  logodds  estimate,  and  a  ranking  of  documents  for  a 
particular  query  by  logodds  is  of  course  a  probability 
ranking  in  the  IR  sense.  For  further  discussion  of  the 
linked  dependence  assumption  and  a  formal  derivation  of 
Eq.  (1)  from  it,  see  Cooper  (1991)  and  Cooper,  Dabney  & 
Gey  (1992).  An  empirical  investigation  of  independence 
and  dependence  assumptions  in  IR  is  reported  by  Gey 
(1993). 

Application  to  Term  Properties 

We  shall  consider  as  a  first  appUcation  of  the  model- 
ing equation  the  problem  of  how  to  exploit  the  properties 
of  match  terms.  By  a  'match'  term  we  mean  a  stem  (or 
word,  phrase,  etc.)  that  occurs  in  the  query  and  also,  in 
identical  or  related  form,  in  the  document  to  which  the 
query  is  being  compared.  The  retrieval  properties  of  a 
match  term  could  include  its  frequency  characteristics  in 
the  query,  in  the  document,  or  in  the  collection  as  a  whole, 
its  grammatical  characteristics,  the  type  or  degree  of 
'match'  involved,  etc.  If  there  are  M  match  terms  that 
relate  a  certain  query  to  a  certain  document,  Eq.  (1) 
becomes  applicable  with  set  to  M.  Each  of  the  proper- 
ties Ax,...,  Am  then  represents  a  composite  clue  or  set  of 
properties  concerning  one  of  the  query-document  pair's 
match  terms. 

Suppose  a  'learning  set'  of  human  judgements  of 
relevance  or  nonrelevance  is  available  for  a  sample  of  rep- 
resentative query-document  pairs.  (However,  for  the  time 
being  we  assume  no  such  learning  data  is  available  for  the 
very  queries  on  which  the  retrieval  is  to  be  performed.) 
Logistic  regression  can  be  applied  to  the  data  to  develop 
an  equation  capable  of  using  the  match  term  clues  to  esti- 
mate, for  any  query-document  pair,  values  for  each  of  the 
expressions  in  the  right  side  of  Eq.  (1).  Eq.  (1)  then  yields 
a  probability  estimate  for  the  pair. 
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TABLE  I:    Equations  to  Approximate  the  Components  of  Eq.  (1) 


logO(/?)=  [bo  +  biM] 

logO(^Mi)-log(9(/?)  «  [ao  +  aiXi,i  +  .--  +  fljfXi,^  +  aj^+iM]  -  [bo  +  biM] 


\ogO(R\AM)-logO(R)      [ao  +  aiXM,i  +  ---  +  aKXM,K  +  aK+iM]  -         [bo  +  b^M] 

M  M 

logO(/?Ui  ,...,Am)     «[aoM  +  fli  I  +      2  X^,K  +  aK^:M^]-{M-l)[bo  +  b:M] 


What  must  be  done,  then,  is  to  find  approximating 
expressions  for  the  various  components  in  the  right  side  of 
Eq.  (1).  Table  I  (above)  shows  the  interrelationships 
among  the  quantities  involved.  In  this  array,  the  material 
to  the  left  of  the  column  of  approximation  signs  just 
restates  Eq.  (1),  for  Eq.  (1)  asserts  that  the  expressions 
above  the  horizontal  line  sum  to  the  expression  below  the 
line.  On  the  right  of  each  approximation  sign  is  a  simple 
formula  that  might  be  used  to  approximate  the  expression 
on  the  left.  (More  complex  formulae  could  of  course  also 
be  considered.)  Each  right-side  formula  involves  a  linear 
combination  of  K  variables  X^  i , . . . ,  X^jc  corresponding 
to  the  K  retrieval  properties  (frequency  values,  etc.)  used 
to  characterize  the  term  in  question.  The  idea  is  to  apply 
logistic  regression  to  find  the  values  for  the  coefficients 
(the  a's  and  fe's)  in  the  right  side  that  produce  the  best 
regression  fit  to  the  learning  sample.  Once  these  constants 
have  been  determined,  retrieval  can  be  carried  out  by  sum- 
ming the  right  sides  to  obtain  the  desired  estimate  of 
log  0(R\  Ai  A{^)  for  any  given  query-document 
pair. 

It  may  not  be  immediately  apparent  why  terms  in  M 
have  been  included  in  the  approximating  formulas  on  the 
right.  In  Eq.  (1),  the  number  of  available  retrieval  clues 
is  actually  part  of  the  conditionalizing  information,  a  fact 
that  could  have  been  stressed  by  writing  0(7?  I  A^)  in  place 
of  OiR)  and  0(R  I  ^„  N)  in  place  of  0{R  I  Ai).  So  the 
approximating  formula  for  log  OiR)  is  really  an  approxi- 
mation for  log  0(R\  M)  and  must  involve  a  term  in  M. 
(Simple  functions  of  M  such  as  Vm  or  log  (M)  might  also 
be  considered.)  Similar  remarks  apply  to  the  formulas  for 


approximating  log  (R I  A„). 

There  are  two  approaches  to  the  problem  of  finding 
coefficients  to  use  in  the  approximating  expressions.  The 
first  is  what  might  be  called  the  'triples-then-pairs' 
approach.  It  starts  out  with  a  logistic  regression  analysis 
performed  directly  on  a  sample  of  query-document-term 
triples  constructed  from  the  learning  set.  In  the  sample, 
each  triple  is  accompanied  by  the  clue-values  associated 
with  the  match  term  in  the  query  and  document,  and  by 
the  human  relevance  judgement  for  the  query  and  docu- 
ment. The  clue-values  are  used  as  the  values  of  the  inde- 
pendent variables  in  the  regression,  and  the  relevance 
judgements  as  the  values  of  the  the  dependent  variable. 
After  the  desired  coefficients  have  been  obtained,  the 
resulting  regression  equation  can  be  appUed  to  evaluate 
the  left-side  expressions,  which  can  in  turn  be  summed 
down  the  M  terms  to  obtain  a  preliminary  estimate  of 
log  OiR  \A^,...,  Am). 

A  second  regression  analysis  is  then  performed  on  a 
sample  of  query-document  pairs,  using  the  preUminary 
logodds  prediction  from  the  first  (triples)  analysis  as  an 
independent  variable.  This  second  analysis  serves  the  pur- 
pose of  correcting  for  distortions  introduced  by  the 
Assumption  of  Linked  Dependence.  It  also  allows 
retrieval  properties  not  associated  with  particular  match 
terms,  such  as  document  length,  age,  etc.,  to  be  intro- 
duced. The  triples-then-pairs  approach  is  the  one  used  by 
the  Berkeley  group  in  its  Trec-1  research  (Cooper,  Gey  & 
Chen  1992).  The  theory  behind  it  is  presented  in  more 
detail  in  (Cooper,  Dabney  &  Gey  1992). 


59 


The  second  way  of  determining  the  coefficients  ~ 
the  one  used  in  Trec-2  -  is  the  'pairs-only'  approach.  It 
requires  only  one  regression  analysis,  performed  on  a 
sample  of  query-document  pairs.  It  is  based  on  the  trivial 
observation  that  in  the  right  side  of  the  above  array, 
instead  of  adding  across  rows  and  then  down  the  resulting 
column  of  sums,  one  can  equivalently  add  down  columns 
and  across  the  resulting  row  of  sums.  Under  either  proce- 
dure the  grand  total  value  for  log  0{R  \  Ax,...,Am)  will 
be  the  same. 

Summing  down  the  columns  and  then  across  the 
totals  gives  the  expression  shown  in  the  bottom  line  of  the 
array.  It  simplifies  to 

logO(/?Ui  ,...,Ati)  « 

M  M 
m=l  m=l 

+  (ao  -bo  +  bi)M  +  (qk+i  -  bi)M^ 

Since  there  is  no  need  to  keep  the  Oj  coefficients  segre- 
gated from  the  bj  coefficients  to  get  a  predictive  equation, 
this  suggests  the  adoption  of  a  regression  equation  of  form 

log  0(R\Ai,..., Am)  = 

M  M 

Co  +  Ci  X  ^m,\  +  •  •  •  +  Ca:  X  ^m,K 
m=l  m=l 

The  coefficients  Cq,..., Ck+2  may  be  found  by  a  logistic 
regression  on  a  sample  of  query-document  pairs  con- 
structed from  the  learning  sample.  In  the  sample  each  pair 
is  accompanied  by  its  K  different  X^^t -values  each 
already  summed  over  all  match  terms  for  the  pair,  the  val- 
ues of  M  and  M^,  and  (to  serve  as  the  dependent  variable 
in  the  regression)  the  human  relevance  judgement  for  the 
pair. 

But  if  only  one  level  of  regression  analysis  is  to  be 
performed,  where  is  the  correction  for  the  Assumption  of 
Linked  Dependence  to  take  place?  That  assumption 
causes  mischief  because  it  creates  a  tendency  for  the  pre- 
dicted logodds  of  relevance  to  increase  roughly  Unearly 
with  the  number  of  match  terms,  whereas  the  true  increase 
is  less  than  hnear.  This  can  be  corrected  by  modifying  the 
variables  in  such  a  way  that  their  values  rise  less  rapidly 
than  the  number  of  match  terms  as  the  number  of  match 
terms  increases.  The  variables  can,  for  instance,  be  multi- 
plied by  some  function  /(M)  that  drops  gently  with 

increasing  M,  say       or  - — ^ — — .  The  exact  form  of 
Vm      1  +  logM 


the  function  can  be  decided  during  the  course  of  the 
regression  analysis. 

With  such  a  damping  factor  included,  the  foregoing 
regression  equation  becomes 
Equation  (2): 

log  - 

M  M 
C  0+  Ci  f{M)  X  +---+CK  /(M)  X  X„,K 

1  m=l 

In  our  experiments,  this  simple  modification  was  found  to 
improve  the  regression  fit  and  the  precision/recall  perfor- 
mance. It  would  appear  therefore  to  be  a  worth-while 
refinement  of  the  basic  model.  Note,  however,  that  this 
adjustment  only  removes  a  general  bias.  It  does  nothing 
to  exploit  the  possibility  of  measuring  dependencies 
between  particular  stems  to  improve  retrieval  effective- 
ness. To  exploit  individual  dependencies  would  be  desir- 
able in  principle,  but  would  require  a  substantial  elabora- 
tion of  the  model  for  what  might  well  turn  out  to  be  an 
insignificant  improvement  in  effectiveness  (for  discussion 
see  Cooper  (1991)). 

OptimaUy  Relativized  Frequencies 

The  philosophy  of  the  project  called  for  the  use  of  a 
few  well-chosen  retrieval  clues.  The  most  obvious  clues 
to  be  exploited  in  connection  with  match  terms  are  their 
frequencies  of  occurrence  in  the  query  and  document. 
What  is  not  so  obvious  is  the  exact  form  the  frequencies 
should  take.  For  instance,  should  they  be  absolute  or  rela- 
tive frequencies,  or  something  in  between? 

The  IR  literature  mentions  two  ways  in  which  fre- 
quencies might  be  relativized  for  use  in  retrieval.  The  first 
is  to  divide  the  absolute  frequency  of  occurrence  of  the 
term  in  the  query  or  document  by  the  length  of  the  query 
or  document,  or  some  parameter  closely  associated  with 
length.  The  second  is  to  divide  the  relative  frequency  so 
obtained  by  the  relative  frequency  of  occurrence  of  the 
term  in  the  entire  collection  considered  as  one  long  run- 
ning text.  Both  kinds  of  relativization  seem  potentially 
beneficial,  but  the  question  remains  whether  these  rela- 
tivizations,  if  they  are  indeed  helpful,  should  be  carried 
out  in  full  strength,  or  whether  some  sort  of  blend  of  abso- 
lute and  relative  frequencies  might  serve  better. 

To  answer  this  question,  regression  techniques  were 
used  in  a  side  investigation  to  discover  the  optimal  extent 
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to  which  each  of  the  two  kinds  of  relativization  might  best 
be  introduced.  To  investigate  the  first  kind  of  relativiza- 
tion ~  relativization  with  respect  to  query  or  document 
length  -  a  semi-relativized  stem  firequency  variable  of  the 
form 

absolute  frequency 
document  length  +  C 

was  adopted.  If  the  constant  C  is  chosen  to  be  zero,  one 
has  full  relativization,  whereas  a  value  for  C  much  greater 
than  the  document  length  will  cause  the  variable  to  behave 
in  the  regression  analysis  as  though  it  were  an  absolute 
frequency.  Several  logistic  regressions  were  run  to  dis- 
cover by  trial-and-error  the  value  of  C  that  produced  the 
best  regression  fit  to  the  TIPSTER  learning  data  and  the 
highest  precision  and  recall.  It  was  found  that  a  value  of 
around  C  =  35  for  queries,  and  C  =  80  for  documents, 
optimized  performance. 

For  the  relativization  with  respect  to  the  global  rela- 
tive frequency  in  the  collection,  the  relativizing  effect  of 
the  denominator  was  weakened  by  a  slightly  different 
method  -  by  raising  it  to  a  power  less  than  1.0.  For  the 
document  frequencies,  for  instance,  a  variable  of  form 

absolute  frequency  in  doc 

 document  length  +  80  

[  relative  frequency  in  collection  ]  ^ 

with  D  <  1  was  in  effect  used.  Actually,  it  was  the  loga- 
rithm of  this  expression  that  was  ultimately  adopted, 
which  allowed  the  variable  to  be  broken  up  into  a  differ- 
ence of  two  logarithmic  expressions.  The  optimal  value 
of  the  power  D  was  therefore  obtainable  in  a  single 
regression  as  the  coefficient  of  the  logged  denominator. 

Thus  the  variables  ultimately  adopted  consisted  in 
essence  of  sums  over  two  optimally  relativized  frequen- 
cies CORF's)  "  one  for  match  stem  frequency  in  the 
query  and  one  for  match  stem  frequency  in  the  document. 
Because  of  the  logging  this  breaks  up  mathematically  into 
three  variables.  A  final  logistic  regression  using  sums 
over  these  variables  as  indicated  in  Eq.  (2)  produced  the 
ranking  equation 

Equation  (3): 

log  0(R\Ai,..., Am)  « 

-3.51  +  -pJ —  O  +  0.0929  A/ 


where  O  is  the  expression 

M  M  M 

37.4  2       +0.330  X  ^..2  -0. 1937  L  X„^, 

m=l  m=l  m=l 

Here 

X„  1  =  number  of  times  the  w'th  stem  occurs  in  the 
query,  divided  by  (total  number  of  all  stem  occur- 
rences in  query  +  35); 

X„2  -  number  of  times  the  /n'th  stem  occurs  in  the 
document,  divided  by  (total  number  of  all  stem 
occurrences  in  document  +  80),  quotient  logged; 

X^  -i  =  number  of  times  the  /n'th  stem  occurs  in  the 
collection,  divided  by  the  total  number  of  all  stem 
occurrences  in  the  collection,  quotient  logged; 

M  =  number  of  distinct  stems  common  to  both 
query  and  document. 

Although  Eq.  (2)  calls  for  an  term  as  well,  such  a  term 
was  found  not  to  make  a  statistically  significant  contribu- 
tion and  so  was  eliminated. 

Eq.  (3)  provided  the  ranking  rule  used  in  the  ad  hoc 
run  (labeled  'Brkly3')  and  the  first  routing  run  ('Brkly4') 
submitted  for  Trec-2.  The  equation  is  notable  for  the  spar- 
sity  of  the  information  it  uses.  Essentially  it  exploits  only 
two  ORE  values  for  each  match  stem,  one  for  the  stem's 
frequency  in  the  query  and  the  other  for  its  frequency  in 
the  document.  Other  variables  were  tried  including  the 
inverse  document  frequency  (both  logged  and  unlogged), 
a  variable  consisting  of  a  count  of  all  two-stem  phrase 
matches,  and  several  variables  for  measuring  the  tendency 
of  the  match  stems  to  bunch  together  in  the  document.  All 
of  these  exhibited  predictive  power  when  used  in  isola- 
tion, but  were  discarded  because  in  the  presence  of  the 
ore's  none  produced  any  detectable  improvement  in  the 
regression  fit  or  the  precision/recall  performance.  Some 
attempts  at  query  expansion  using  the  WordNet  thesaurus 
also  failed  to  produce  noticable  improvement,  even  when 
care  was  taken  to  create  separate  variables  with  separate 
coefficients  for  the  synonym-match  counts  as  opposed  to 
the  exact-match  counts. 

The  quality  of  a  retrieval  output  ranking  matters 
most  near  the  top  of  the  ranking  where  it  is  likely  to  affect 
the  most  users,  and  matters  hardly  at  all  far  down  the 
ranking  where  hardly  any  users  are  apt  to  search.  Because 
of  this  it  is  desirable  to  adopt  a  sampUng  methodology  that 
produces  an  especially  good  regression  fit  to  the  sample 
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data  for  documents  with  relatively  high  predicted  proba- 
bilities of  relevance.  In  an  attempt  to  achieve  this  empha- 
sis, the  regression  analysis  was  carried  out  in  two  steps.  A 
logistic  regression  was  first  performed  on  a  stratified  sam- 
ple of  about  one-third  of  the  then-available  TIPSTER  rele- 
vance judgements  to  obtain  a  preUminary  ranking  rule. 
This  preliminary  equation  was  then  applied  as  a  screening 
rule  to  the  entire  available  body  of  judged  pairs,  and  for 
each  query  all  but  the  highest-ranked  500  documents  were 
discarded.  The  resulting  set  of  50,000  judged  query- 
document  pairs  (100  topics  x  500  top-ranked  docs  per 
topic)  served  as  the  learning  sample  data  for  the  final 
regression  equation  displayed  above  as  Eq.  (3). 

Application  to  Query-Specific  Learning  Data 

To  this  point  it  has  been  assumed  that  relevance 
judgement  data  is  available  for  a  sample  of  queries  typical 
of  the  future  queries  for  which  retrieval  is  to  be  per- 
formed, but  not  for  those  very  queries  themselves.  We 
consider  next  the  contrasting  situation  in  which  the  learn- 
ing data  that  is  available  is  a  set  of  past  relevance  judge- 
ments for  the  very  queries  for  which  new  documents  are 
to  be  retrieved.  Such  data  is  often  available  for  a  routing 
or  SDI  request  for  which  relevance  feedback  has  been  col- 
lected in  the  past,  or  for  a  retrospective  feedback  search 
that  has  akeady  been  under  way  long  enough  for  some 
feedback  to  have  been  obtained. 

For  a  query  so  equipped  with  its  very  own  learning 
sample,  it  is  possible  to  gather  separate  data  about  each 
individual  term  in  the  query.  Such  data  reflects  the  term's 
special  retrieval  characteristics  in  the  context  of  that 
query.  For  instance,  for  a  query  term  T,  we  may  count  the 
number  n{Ti,  R)  of  documents  in  the  sample  that  con- 
tain the  term  and  are  also  relevant  to  the  query,  the  num- 
ber of  documents  n(Ti,R)  that  contain  the  term  but  are  not 
relevant  to  the  query,  and  so  forth. 

The  odds  that  a  future  document  will  turn  out  to  be 
be  relevant  to  the  query,  given  that  it  contains  the  term, 
can  be  estimated  crudely  as 


0{R\  Ti)  = 


njTj,  R)  +  /J  n(R) 
n{Ti,R)  +  p  n{R) 


The  smaller  the  value  assigned  to  yff,  the  more  influ- 
ence the  sample  evidence  will  have  in  moving  the  esti- 
mate away  from  the  prior  odds.  The  adjustment  causes 
large  samples  to  have  a  greater  effect  than  small,  as  seems 
natural.  It  also  forestalls  certain  absurdities  implicit  in  the 
unadjusted  formula,  for  instance  the  possibility  of  calcu- 
lating infinite  odds  estimates. 

Suppose  now  that  a  query  contains  Q  distinct  terms 
Ti,...,Tm,Tm+i,...,Tq,  numbered  in  such  a  way  that 
the  first  M  of  them  are  match  terms  that  occur  in  the  docu- 
ment against  which  the  query  is  to  be  compared.  The  fun- 
damental equation  can  be  applied  by  taking  N  to  be  Q, 
and  interpreting  the  retrieval  clues  Ai, . . , ,  A^r  as  the  pres- 
ence or  absence  of  particular  query  terms  in  the  document. 
The  pertinent  data  can  be  organized  as  shown  in  Table  II 
(next  page). 

Thus  we  are  led  to  postulate,  for  query-specific 
learning  data,  a  retrieval  equation  of  form 

Equation  (4): 

log  WlAi  Aq)  - 


Co  +  Ci/(0 


where 


OiR  I  r,)  - 


n(Ti,RJ 
n(Ti,R) 


^  ,     n{Ti,R)  +  pn{R) 

Oi  =    >  I02  =  =- 

tx    ^  n{Ti,R)  +  pn{R) 


i  ,og«±«S 
m=M+i    ^  n{Ti,R)^pn{R) 


03=  (e-l)log^ 

and  where  f{Q)  is  some  restraining  function  intended  as 
before  to  subdue  the  inflationary  effect  of  the  Assumption 
of  Linked  Dependence.  The  coefficients  Cq,  Cx  are  found 
by  a  logistic  regression  over  a  sample  of  query-document 
pairs  involving  many  queries. 


To  refine  this  estimate,  a  Bayesian  trick  adapted  from 
Good  (1965)  is  useful.  One  simply  adds  the  two  figures 
used  in  expressing  the  prior  odds  of  relevance  (i.e.  n  (R) 
and  n  (R))  into  the  numerator  and  denominator  respec- 
tively with  an  arbitrary  weight  /}  as  follows: 


Combining  Query-Specific  and  Query-Nonspecific 
Data 

If  a  query-specific  set  of  relevance  judgements  is 
available  for  the  query  to  be  processed,  a  larger 


62 


TABLE  II:  Reinterpretation  of  the  Components  of  Eq.  (1)  for  Routing  Retrieval 

log  0(R)  = 

,  n(R) 

log  0(R\  AO - 

log  OiR)  = 

n(T„R)  +  fi  n{R) 
^  niT,,R)^Pn{R) 

,  n(R) 
'''niR) 

• 

log  0(R\  Am) - 

log  0(R)  = 

n(i  M,R)  +  P  n{K) 
n{jM.R)  +  Pn{R) 

log"^-^ 
^n{R) 

log  0(R\Am^i)- 

-  log  OiR)  = 

niTM^^,R)  +  /3n(R) 
n{TM^y,R)  +  Pn{,R) 

,  n{R) 
'''niR) 

log  0(R\Aq)- 

log  0{R)  = 

J    n{TQ,R)  +  pn{R) 
n(TQ,R)  +  lin{R) 

'""'niR) 

log  0(R\Ai  ,. 

..,Aq)  = 

M      n{T„R)  +  /in{R) 

2^  log  =  =^  + 

r   ^n(Ti,R)  +  fin(R) 

tx   ^n{Ti,R)  +  finiR) 

'''''' MR) 

nonspecific  set  may  well  be  available  at  the  same  time.  If 
so  the  theory  developed  in  the  foregoing  section  can  be 
^plied  in  conjunction  with  the  earlier  theory  to  capture 
the  benefits  of  both  kinds  of  learning  sets.  The  retrieval 
equation  will  then  contain  variables  not  only  of  the  kind 
occurring  in  Eq.  (4)  but  also  of  the  Eq.  (2)  kind. 

It  is  convenient  to  formulate  this  equation  in  such  a 
way  that  it  contains  as  one  of  its  terms  the  entire  ranking 
expression  developed  earlier  for  the  nonspecific  learning 
data.  For  the  TIPSTER  data  the  combined  equation  takes 
the  form: 

Equation  (5): 

logO(R\A„...,AM>Al---^AQ)^ 

0. 688  4)4  +  0. 344      +  O2  -    j  +  0. 0623 

where  is  the  entire  right  side  of  Eq.  (3)  and  ,  O2, 03 
are  as  defined  in  Eq.  (4).  This  form  for  the  equation  is 
computationally  convenient  if  Eq.  (3)  is  to  be  used  as  a 
preliminary  screening  rule  to  eliminate  unpromising 


documents,  with  Eq.  (5)  in  its  entirety  applied  only  to  rank 
those  that  survive  the  screening. 

Eq.  (5)  was  used  to  produce  the  Trec-2  routing  run 
'Brkly5'.  Its  coefficients  were  determined  by  a  logistic 
regression  constrained  in  such  a  way  as  to  make  the 
query-specific  variables  contribute  about  twice  as  heavily 
as  the  nonspecific,  when  contribution  is  measured  by  stan- 
dardized coefficient  size.  This  emphasis  was  largely  arbi- 
trary; finding  the  optimal  balance  between  the  query- 
specific  and  the  general  contributions  remains  a  topic  for 
future  research.  A  value  of  20 /«(/?)  was  used  for  y9.  This 
choice  too  was  arbitrary,  and  it  would  be  interesting  to  try 
to  optimize  it  experimentally  for  some  typical  collection 
(trying  out,  perhaps,  numbers  larger  than  20,  divided  by 
the  total  number  of  documents  in  the  query's  learning  set). 
No  restraining  function  f{Q)  was  used  in  the  final  form  of 
Eq.  (5)  because  none  that  were  tried  out  produced  any  dis- 
cernible improvement  in  fit  or  retrieval  effectiveness  in 
this  context. 
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Calibration  of  Probability  Estimates 

The  most  important  role  of  the  relevance  probability 
estimates  calculated  by  a  probabilistic  IR  system  is  to  rank 
the  output  documents  in  as  effective  a  search  order  as  pos- 
sible. For  this  ranking  function  it  is  only  the  relative  sizes 
of  the  probability  estimates  that  matter,  not  their  absolute 
magnitudes.  However,  it  is  also  desirable  that  the  absolute 
sizes  of  these  estimates  be  at  least  somewhat  realistic.  If 
they  are,  they  can  be  displayed  to  provide  guidance  to  the 
users  in  their  decisions  as  to  when  to  stop  searching  down 
the  ranking.  This  capability  is  a  potentially  important  side 
benefit  of  the  probabilistic  approach. 

One  way  of  testing  the  realism  of  the  probability 
estimates  is  to  see  whether  they  are  'well-calibrated'. 
Good  calibration  means  that  when  a  large  number  of  prob- 
ability estimates  whose  magnitudes  happen  to  fall  in  a  cer- 
tain small  range  are  examined,  the  proportion  of  the  trials 
in  question  with  positive  outcomes  also  falls  in  or  close  to 
that  range.  To  test  the  calibration  of  the  probability  pre- 
dictions produced  by  the  Berkeley  approach,  the  50,000 
query-document  pairs  in  the  ad  hoc  entry  Brkly3  along 
with  their  accompanying  relevance  probability  estimates 
were  sorted  in  descending  order  of  magnitude  of  estimate. 
Pairs  for  which  human  judgements  of  relevance- 
relatedness  were  unavailable  were  discarded;  this  left 
22,352  sorted  pairs  for  which  both  the  system's  probabil- 
ity estimates  of  relevance  and  the  'correct'  binary  judge- 
ments of  relevance  were  available.  This  shorter  list  was 
divided  into  blocks  of  1,000  pairs  each  ~  the  tiiousand 
pairs  with  the  highest  probability  estimates,  the  thousand 
with  the  next  highest,  and  so  forth.  Within  each  block  the 
'actual'  probability  was  estimated  as  the  proportion  of  the 
1,000  pairs  that  had  been  judged  to  be  relevance-related 
by  humans.  This  was  compared  against  the  mean  of  all 
the  system  probability  estimates  in  the  block.  For  a  well- 
calibrated  system  these  figures  should  be  approximately 
equal. 

The  results  of  the  comparison  are  displayed  in  Table 
III.  It  can  be  seen  that  the  system's  probability  predic- 
tions, while  not  wildly  inaccurate,  are  generally  somewhat 
higher  than  the  actual  proportions  of  relevant  pairs.  The 
same  phenomenon  of  mild  overestimation  was  observed 
when  the  runs  Brkly4  and  Brkly5  were  tested  for  well- 
calibratedness  in  a  similar  way. 

Since  no  systematic  overestimation  was  observed 
when  the  calibration  of  the  formula  was  originally  tested 
against  the  learning  data,  it  seems  likely  that  the 


TABLE  ni:  Calibration  of  Ad  Hoc 

Relevance-Probability  Estimates 

Query-doc 

Mean  System 

Proportion 

Pair 

Probability 

Actually 

Ranks 

Estimate 

Relevant 

Ito  1,000 

0.66 

0.60 

1,001  to  2,000 

0.63 

0.47 

2,001  to  3,000 

0.61 

0.44 

3,001  to  4,000 

0.58 

0.41 

4,001  to  5,000 

0.55 

0.38 

5,001  to  6,000 

0.53 

0.34 

6,001  to  7,000 

0.50 

0.36 

7,001  to  8,000 

0.48 

0.36 

8,001  to  9,000 

0.46 

0.36 

9,001  to  10,000 

0.44 

0.38 

10,001  to  11,000 

0.42 

0.39 

11,001  to  12,000 

0.41 

0.36 

12,001  to  13,000 

0.39 

0.37 

13,001  to  14,000 

0.37 

0.36 

14,001  to  15,000 

0.36 

0.35 

15,001  to  16,000 

0.34 

0.31 

16,001  to  17,000 

0.32 

0.29 

17,001  to  18,000 

0.31 

0.28 

18,001  to  19,000 

0.29 

0.23 

19,001  to  20,000 

0.28 

0.22 

20,001  to  21,000 

0.25 

0.21 

21,001  to  22,000 

0.23 

0.23 

22,001  to  22,352 

0.18 

0.19 

overestimation  seen  in  the  table  is  due  mainly  to  the  shift 
from  learning  data  to  test  data.  Naturally,  predictive  for- 
mulae that  have  been  fine-tuned  to  a  certain  set  of  learning 
data  will  perform  less  well  when  applied  to  a  new  set  of 
data  to  which  they  have  not  been  fine-tuned.  If  this  is 
indeed  the  root  cause  of  the  observed  overestimation,  it 
could  perhaps  be  compensated  for  (at  least  to  an  extent 
sufficient  for  practical  purposes)  by  the  crude  expedient  of 
lowering  all  predicted  probabilities  to,  say,  around  four 
fifths  of  their  originally  calculated  values  before  display- 
ing them  to  the  users. 

Computational  Experience 

The  statistical  program  packages  used  in  the  course 
of  the  analysis  included  SAS,  S,  and  BLSS.  Of  these, 
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SAS  provides  the  most  complete  built-in  diagnostics  for 
logistic  regression.  BLSS  was  found  to  be  especially  con- 
venient for  interactive  use  in  a  UNIX  environment,  and 
ended  up  being  the  most  heavily  used. 

The  prototype  retrieval  system  itself  was  imple- 
mented as  a  modification  of  the  SMART  system  with 
SMART'S  vector-similarity  subroutines  replaced  by  the 
probabilistic  computations  of  Eqs.  (3)  and  (5).  For  the 
runs  BrklyS  and  Brkly4,  which  used  only  Eq.  (3),  only 
minimal  modifications  of  SMART  were  needed,  and  at 
run  time  the  retrieval  efficiency  remained  essentially  the 
same  as  for  the  unmodified  SMART  system.  This  demon- 
strates that  probabilistic  retrieval  need  be  no  more  cum- 
bersome computationally  than  the  vector  processing  alter- 
natives. For  Brkly5,  which  used  Eq.  (5),  the  modifications 
were  somewhat  more  extensive  and  retrieval  took  about 
20%  longer. 

Retrieval  Effectiveness 

The  Berkeley  system  achieved  an  average  precision 
over  all  documents  (an  '11-point  average')  of  32.7%  for 
the  ad  hoc  retrieval  run  Brkly3,  and  29.0%  and  35.4% 
respectively  for  the  routing  runs  Brkly4  and  BrklyS.  The 
distinct  improvement  in  effectiveness  of  BrklyS  over 
Brkly4  suggests  that  in  routing  retrieval  the  use  of  fre- 
quency information  about  individual  query  stems  is  worth 
while. 

At  the  '0  recall  level'  a  precision  of  84.7%,  the 
highest  recorded  at  the  conference,  was  achieved  in  the  ad 
hoc  run.  The  high  effectiveness  of  the  Berkeley  system 
for  the  first  few  retrieved  documents  may  be  explainable 
in  terms  of  the  practice,  mentioned  earlier,  of  redoing  the 
regression  analysis  for  the  highest-ranked  500  documents 
for  each  query.  This  technique  ensures  an  especially  good 
regression  fit  for  the  query-document  pairs  that  are  espe- 
cially likely  to  be  relevant,  thus  emphasizing  good  perfor- 
mance near  the  top  of  the  ranking  where  it  is  most  impor- 
tant. 

The  generally  high  retrieval  effectiveness  of  the 
Berkeley  system  should  be  interpreted  in  the  light  of  the 
fact  that  the  system  probably  uses  less  evidence  -  that  is, 
fewer  retrieval  clues  —  than  any  of  the  other  high- 
performing  TREC-2  systems.  In  fact,  the  only  clues  used 
were  the  frequency  characteristics  of  single  stems  (not 
even  phrases  were  included).  What  this  suggests  is  that 
the  underlying  probabilistic  logic  may  have  the  capacity  to 
exploit  exceptionally  fully  whatever  clues  may  be 


available. 

Summary  and  Conclusions 

The  Berkeley  design  approach  is  based  on  a  proba- 
bilistic model  derived  from  the  linked  dependence 
assumption.  The  variables  of  the  probability-ranking 
retrieval  equation  and  their  coefficients  are  determined  by 
logistic  regression  on  a  judgement  sample.  Though  the 
model  is  hospitable  to  the  utilization  of  other  kinds  of  evi- 
dence, in  this  particular  investigation  the  only  variables 
used  were  optimally  relativized  frequencies  (ORF's)  of 
match  stems. 

The  approach  was  found  to  have  the  following 
advantages: 

1.  Experimental  Efficiency.  Since  the  numeric  coeffi- 
cients in  a  regression  equation  are  determined 
simultaneously  in  one  computation,  trial-and-error 
experimentation  involving  the  evaluation  of 
retrieval  output  to  optimize  parameters  is  largely 
avoidable. 

2.  Computational  Simplicity.  For  ad  hoc  retrieval  and 
routing  retrieval  that  does  not  involve  individual 
stem  statistics,  the  computational  simplicity  and 
efficiency  achieved  by  the  model  at  run  time  are 
comparable  to  that  of  simple  vector  processing 
retrieval  models.  For  routing  retrieval  that  exploits 
individual  stem  frequencies  the  programming  is 
somewhat  more  complicated  and  runs  slightly 
slower. 

3.  Effective  Retrieval.  The  level  of  retrieval  effective- 
ness as  measured  by  precision  and  recall  is  high  rel- 
ative to  the  simple  clue-types  used. 

4.  Potential  for  Well-Calibrated  Probability  Estimates. 
In-the-ballpark  estimates  of  document  relevance 
probabilities  suitable  for  output  display  would 
appear  to  be  within  reach. 
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Abstract 

In  this  paper,  we  describe  the  application  of  probabilis- 
tic models  for  indexing  and  retrieval  with  the  TREC-2 
collection.  This  database  consists  of  about  a  million 
documents  (2  gigabytes  of  data)  and  100  queries  (50 
routing  and  50  adhoc  topics).  For  document  indexing, 
we  use  a  description-oriented  approach  which  exploits 
relevance  feedback  data  in  order  to  produce  a  probabilis- 
tic indexing  with  single  terms  as  well  as  with  phrases. 
With  the  adhoc  queries,  we  present  a  new  query  term 
weighting  method  based  on  a  training  sample  of  other 
queries.  For  the  routing  queries,  the  RPI  model  is  ap- 
plied which  combines  probabilistic  indexing  with  query 
term  weighting  based  on  query-specific  feedback  data. 
The  experimental  results  of  our  approach  show  very 
good  performance  for  both  types  of  queries. 

1  Introduction 

The  good  TREC-1  results  of  our  group  described  in 
[Fuhr  Sz  Buckley  93]  have  confirmed  the  general  concept 
of  probabilistic  retrieval  as  a  learning  approach.  In  this 
paper,  we  describe  some  improvements  of  the  indexing 
and  retrieval  procedures.  For  that,  we  first  give  a  brief 
outline  of  the  document  indexing  procedure  which  is 
based  on  description-oriented  indexing  in  combination 
with  polynomial  regression.  Section  3  describes  query 
term  weighting  for  adhoc  queries,  where  we  have  devel- 
oped a  new  learning  method  based  on  a  training  sam- 
ple of  other  queries  and  corresponding  relevance  judge- 
ments. In  section  4,  the  construction  of  the  routing 
queries  is  presented,  which  is  based  on  the  probabilistic 
RPI  retrieval  model  for  query-specific  feedback  data.  In 
the  final  conclusions,  we  suggest  some  further  improve- 
ments of  our  method. 

2  Document  indexing 

The  task  of  probabilistic  document  indexing  can  be  de- 
scribed as  follows  (see  [Fuhr  &  Buckley  91]  for  more  de- 


tails): Let  dm  denote  a  document,  ti  a  term  and  R  the 
fact  that  a  query-document  pair  is  judged  relevant,  then 
P{R\ti,dm)  denotes  the  probability  that  document  dm 
will  be  judged  relevant  w.r.t.  an  arbitrary  query  that 
contains  term  ti.  Since  these  weights  can  hardly  be  es- 
timated directly,  we  use  the  description-oriented  index- 
ing approach.  Here  term-document  pairs  {ti,dm)  are 
mapped  onto  so-called  relevance  descriptions  x(ti,dm)- 
The  elements  Xi  of  the  relevance  description  contain  val- 
ues of  features  oiti,dm  and  their  relationship,  like  e.g. 


tf  within-document  frequency  (wdf)  oiti, 

logidf  —  log(inverse  document  frequency), 

lognumterms  =  log(number  of  different  terms  in  dm), 

imaxtf  =  l/(maximum  wdf  of  a  term  in  dm) 

is-singlt  =1,  if  term  is  a  single  word,  =0  otherwise 

is. phrase  =1,  if  term  is  a  phrase,  =0  otherwise. 


(As  phrases,  we  considered  all  adjacent  non-stopwords 
that  occurred  at  least  25  times  in  the  D1+D2  (training) 
document  set.) 

Based  on  these  relevance  descriptions,  we  estimate 
the  probability  P{R\x{ti,dm))  that  an  arbitrary  term- 
document  pair  having  relevance  description  x  will  be  in- 
volved in  a  relevant  query-document  relationship.  This 
probability  is  estimated  by  a  so-called  indexing  func- 
tion u{x).  Difl"erent  regression  methods  or  probabilistic 
classification  algorithms  can  serve  as  indexing  function. 

For  our  retrieval  runs  submitted  to  TREC-2,  we  used 
polynomial  regression  for  developing  an  indexing  func- 
tion of  the  form 


u{x)  =  b  ■  v{x),  (1) 


where  the  components  of  v[x)  are  products  of  elements 

of  X. 

The  indexing  function  actually  used  has  the  form 
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u{x)  = 

bo 

+ 

bi 

is  single 

■tf 

logidf 

imaxtf  + 

62 

issingle 

■tf 

imaxtf 

+ 

63 

issingle 

■  logidf 

+ 

64 

lognumterms 

•  imaxtf 

+ 
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isjphrase 

■tf 

logidf 

imaxtf  + 

fee 

is-phrase 

■tf 

imaxtf 

+ 

67 

isjphrase 

•  logidf. 

The  coefficient  vector  b  is  computed  based  on  a  training 
sample  of  query-document-pairs  with  relevance  judge- 
ments. 

Since  polynomial  functions  may  yield  results  outside  the 
interval  [0, 1],  these  values  were  mapped  onto  the  corre- 
sponding boundaries  of  this  interval. 
For  each  phrase  occurring  in  a  document,  indexing 
weights  for  the  phrase  as  well  as  for  its  two  components 
(as  single  words)  were  computed. 

There  are  two  major  problems  with  this  approach  which 
we  are  investigating  currently: 

1.  Which  factors  should  be  used  for  defining  the  in- 
dexing functions?  We  are  developing  a  tool  that 
supports  a  statistical  analysis  of  single  factors  for 
this  purpose. 

2.  What  is  the  best  type  of  indexing  function?  Pre- 
vious experiments  have  suggested  that  regression 
methods  outperform  other  probabilistic  classifica- 
tion methods.  As  a  reasonable  alternative  to  poly- 
nomial regression,  logistic  regression  seems  to  offer 
some  advantages  (see  also  [Fuhr  &  Pfeifer  94]).  As 
a  major  benefit,  logistic  functions  yield  only  values 
between  0  and  1,  so  there  is  no  problem  with  out- 
liers. We  are  performing  experiments  with  logistic 
regression  and  compare  the  results  to  those  based 
on  polynomial  regression. 

3    Query  term  weighting  for  ad- 
hoc  queries 

3.1    Theoretical  background 

The  basis  of  our  query  term  weighting  scheme  for  ad-hoc 
queries  is  the  linear  utility-theoretic  retrieval  function 
described  in  [Wong  &  Yao  89].  Let  qj  denote  the  set 
of  terms  occurring  in  the  query,  and  the  indexing 
weight  u{x{ti,dm))  (with  Uim  =  0  for  terms  ti  not  oc- 
curring in  dm)-  If  Cik  gives  the  utility  of  term  ti  for  the 
actual  query  qk,  then  the  utility  of  document  dm  w.r.t. 
query  q^  can  be  computed  by  the  retrieval  function 


For  the  estimation  of  the  utility  weights  Cik ,  we  applied 
two  different  methods. 

As  a  heuristic  approach,  we  used  tf  weights  (the  num- 
ber of  occurrences  of  the  term  ti  in  the  query),  which 
had  shown  good  results  in  the  experiments  described  in 
[Fuhr  &L  Buckley  91]. 

As  a  second  method,  we  applied  linear  regression  to  this 
problem.  Based  on  the  concept  of  polynomial  retrieval 
functions  as  described  in  [Fuhr  89b],  one  can  estimate 
the  probability  of  relevance  of  qk  w.r.t.  dm  by  the  for- 
mula 

P(R\qk,dm)  ^   ^  Cik  -Uim-  (3) 

If  we  had  relevance  feedback  data  for  the  specific  query 
(as  is  the  case  for  the  routing  queries),  this  function 
could  be  used  directly  for  regression.  For  the  ad-hoc- 
queries,  however,  we  have  only  feedback  information 
about  other  queries.  For  this  reason,  we  regard  query 
features  instead  of  specific  queries.  This  can  be  done 
by  considering  for  each  query  term  the  same  features  as 
described  before  in  the  context  of  document  indexing. 
Assume  that  we  have  a  set  of  features  {/o,  /i,  •  •  • ,  /;} 
and  that  xji  denotes  the  value  of  feature  fj  for  term 
ti-  Then  we  assume  that  query  term  weight  Cjfc  can  be 
estimated  by  linear  regression  according  to  the  formula 


Cik  —  I  T  \  / 


(4) 


Here  the  factor  serves  for  the  purpose  of  normaliza- 
tion across  different  queries,  since  queries  with  a  larger 
number  of  terms  tend  to  yield  higher  retrieval  status  val- 
ues with  formula  2.  The  factors  aj  are  the  coefficients 
that  are  to  be  derived  by  means  of  regression.  Now  we 
have  the  problem  that  regression  cannot  be  applied  to 
eqn  4,  since  we  do  not  observe  Cik  values  directly.  In- 
stead, we  observe  relevance  judgements.  This  leads  us 
back  to  the  polynomial  retrieval  function  3,  where  we 
substitute  eqn  4  for  Cik: 

P{R\qk,dm)    «     X]  I  "^"^ 


with 


j  =  0 


Vj  —    /  J    I  rj,.XjiUim 


(5) 


(6) 


Q{(lk,dm)  = 


Cik  '  ^im- 


(2) 


Equation  5  shows  that  we  can  apply  linear  regression 
of  the  form  P{R\qk,dm)  «  a  •  y  to  a  training  sample 
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of  query-document  pairs  with  relevance  judgements  in 
order  to  determine  the  coefficients  Oj .  The  values  yj  can 
be  computed  from  the  number  of  query  terms,  the  values 
of  the  query  term  features  and  the  document  indexing 
weights. 

For  the  experiments,  the  following  parameters  were  con- 
sidered as  query  term  features: 
xo  =1  (constant) 
xi  =  if  (within-query-frequency) 

X2   -  \ogtf 

X3  =tf-  idf 
X4  =  is-phrase 

X5  =  inJitle  {=1,  if  term  occurs  in  query  title, 
=0  otherwise) 

For  most  of  our  experiments,  we  only  used  the  parame- 
ter vector  X  =  (xq,  ■  ■  ■ ,  x^)^ .  The  full  vector  is  denoted 
as  x' .  Below,  we  call  this  query  term  weighting  method 
reg. 

This  method  is  compared  with  the  standard  SMART 
weighting  schemes: 

nnn:  Cik  =  tf 

ntc:  Cik  =  tf  ■  idf 

Inc:  dk  =  {l  +  \ogtf) 

he:  Cik  =  (l  +  logtf) -idf 

3.2  Experiments 

In  order  to  have  three  different  samples  for  learning 
and/or  testing  purposes,  we  used  the  following  com- 
binations of  query  sets  and  document  sets  as  samples: 
Q3/D12  was  used  as  training  sample  for  the  reg  method, 
and  both  Q1/D12  and  Q2/D3  were  used  for  testing.  As 
evaluation  measure,  we  consider  the  11-point  average 
of  precision  (i.e.,  the  average  of  the  precision  values  at 
0.0,0.1, . . 1.0  recall). 


sample 

QTW 

Q1/D12 

Q2/D3 

nnn 

0.2303 

ntc 

0.2754 

Inc 

0.2291 

0.2601 

lie 

0.2826 

0.2783 

reg 

0.2698 

0.2678 

Table  1:  Global  results  for  single  words 

First,  we  considered  single  words  only.  Table  1  shows 
the  results  of  the  different  query  term  weighting  (QTW) 
methods.  First,  it  should  be  noted  that  the  ntc  and  Itc 
methods  perform  better  than  nnn  and  Inc.  This  find- 
ing is  somewhat  different  from  the  results  presented  in 
[Fuhr  &  Buckley  91],  where  the  nnn  weighting  scheme 
gave  us  better  results  than  the  ntc  method  for  Isp  in- 
dexing. However,  in  the  earlier  experiments,  we  used 
only  fairly  small  databases,  and  the  queries  also  were 


much  shorter  than  in  the  TREC  collection.  These  facts 
may  account  for  the  different  results. 


test  sample 

run 

Ip^rnincr  Qamrilp 

I^OlLlIlilc.  Id<XLHL/M\j 

01/012 

02/D3 

1 

every  doc. 

X 

0.2698 

0.2678 

2 

every  100.  doc. 

X 

0.2700 

0.2678 

3 

every  1000.  doc. 

X 

0.2662 

4 

judged  docs  only 

X 

0.2635 

0.2677 

5 

every  doc. 

x' 

0.2654 

6 

every  100.  doc. 

x' 

0.2677 

Table  2:  Variations  of  the  reg  learning  sample  and  the 
query  features 


run 

factor 

1 

2 

3 

4 

5 

6 

constant 

-3.05 

-2.96 

1.04 

-20.80 

2.47 

-1.71 

tf 

-7.54 

-10.67 

-2.40 

14.58 

-4.32 

14.11 

logi/ 

-9.01 

-5.69 

-5.08 

-81.58 

-1.48 

-0.70 

tf  ■  idf 

5.84 

7.00 

1.63 

8.72 

1.81 

7.70 

is-phrase 

-8.06 

-19.23 

-2.73 

19.81 

-2.97 

20.39 

inJitle 

-1.70 

3.44 

Table  3:  Coefficients  of  query  regression 


In  a  second  series  of  experiments,  we  varied  the  sam- 
ple size  and  the  set  of  features  of  the  regression  method 
(table  2).  Besides  using  every  document  from  the  learn- 
ing sample,  we  only  considered  every  100th  and  every 
1000th  document  from  the  database,  as  well  as  only 
those  documents  for  which  there  were  explicit  relevance 
judgements  available.  As  the  results  show  almost  no 
differences,  it  seems  to  be  sufficient  to  use  only  a  small 
portion  of  the  database  as  training  sample  in  order  to 
save  computation  time.  The  additional  consideration 
of  the  occurrence  of  a  term  in  the  query  title  also  did 
not  effect  the  results.  So  query  titles  seem  to  be  not 
very  significant.  It  is  an  open  question  whether  or  not 
other  parts  of  the  queries  are  more  significant,  so  that 
the  consideration  as  an  additional  feature  would  affect 
retrieval  quality. 

The  coefficients  computed  by  the  regression  process  for 
the  second  series  of  experiments  are  shown  in  table  3. 
It  is  obvious  that  the  coeffients  depend  heavily  on  the 
choice  of  the  training  sample  —  so  it  is  quite  surprising 
that  retrieval  quality  is  not  affected  by  this  factor.  The 
only  coefficient  which  does  not  change  its  sign  through 
all  the  runs  is  the  one  for  the  tf  ■  idf  factor.  This  seems 
to  confirm  the  power  of  this  factor.  The  other  factors 
can  be  regarded  as  being  only  minor  modifications  of 
the  if  ■  idf  query  term  weight. 

Overall,  it  must  be  noted  that  the  regression  method 
does  not  yield  an  improvement  over  the  ntc  and  Itc 
methods.  This  seems  to  be  surprising,  since  the  regres-  j 
sion  is  based  on  the  same  factors  which  also  go  into  the  I 
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ntc  and  Itc  formulas.  However,  a  possible  explanation 
could  be  the  fact  that  the  regression  method  tries  to 
minimize  the  quadratic  error  for  all  the  documents  in 
the  learning  sample,  but  our  evaluation  measure  con- 
siders at  most  the  top  ranking  1000  documents  for  each 
query;  so  regression  might  perform  well  for  most  of  the 
documents  from  the  database,  but  not  for  the  top  of 
the  ranking  list.  There  is  some  indication  for  this  ex- 
planation, since  regression  yields  always  slightly  better 
results  at  the  high  recall  end. 


a 

result 

0.00 

0.3199 

0.10 

0.3707 

0.15 

0.3734 

0.20 

0.3700 

0.25 

0.3656 

0.30 

0.3610 

0.50 

0.3451 

1.00 

0.3147 

Table  4:  Effect  of  downweighting  of  phrases  (sample 
Q2/D12) 

As  described  before,  in  our  indexing  process,  we  con- 
sider phrases  in  addition  to  single  words.  This  leads  to 
the  problem  that  when  a  phrase  occurs  in  a  document, 
we  index  the  phrase  in  addition  to  the  two  single  words 
forming  the  phrase.  As  a  heuristic  method  for  overcom- 
ing this  problem,  we  introduced  a  factor  for  downweight- 
ing query  term  weights  for  phrases.  That  is,  the  actual 
query  term  weight  of  a  phrase  is  c'^^.  —  acik,  where  Cik 
is  the  result  of  the  regression  process.  In  order  to  de- 
rive a  value  for  a,  we  performed  a  number  of  test  runs 
with  varying  values  (see  table  4).  Obviously,  weighting 
factors  between  0.1  and  0.3  gave  the  best  results.  For 
the  official  runs,  we  choose  a  =  0.15. 


sample 

QTW 

a 

Q1/D12 

Q2/D3 

lie 

0.15 

0.3192 

0.3131 

Itc 

0.2 

0.3220 

0.3056 

reg 

0.15 

0.3080 

0.3062 

Table  5:  Results  for  single  words  and  phrases 

In  table  5,  this  method  is  compared  with  the  Itc  formula, 
where  we  also  choose  a  weighting  factor  for  phrases 
which  gave  the  best  results.  One  can  see  that  with  the 
sample  Q2/D3,  the  differences  between  the  methods  are 
smaller  than  on  sample  Q1/D12,  but  still  itc  seems  to 
perform  slightly  better. 

Finally,  we  investigated  another  method  for  coping  with 
phrases.  For  that,  let  us  assume  that  we  have  binary 
query  weights  only.  Now  as  an  example,  the  single  words 
ti  and  t2  form  a  phrase  ^3.  For  a  query  with  phrase  ^3 


(and  thus  also  with  ti  and  ^2);  a  document  dm  containig 
the  phrase  would  yield  +  W2m  +  as  value  of  the 
retrieval  function,  where  the  weights  Uim  are  computed 
by  the  Isp  method  described  before.  In  order  to  avoid 
the  effect  of  counting  the  single  words  in  addition  to 
the  phrase,  we  modified  the  original  phrase  weight  as 
follows: 

"3m  =  "3m  -  "im  -  U2m 

and  stored  this  value  as  phrase  weight.  Queries  with 
the  single  words  or  ^2  are  not  affected  by  this  modi- 
fication. For  the  query  with  phrase  ^3,  however,  the  re- 
trieval function  now  would  yield  the  value  uim  +  U2m  + 
"3m  —  "3m,  which  is  what  we  would  like  to  get. 


QTW 

a 

result 

reg 

0.00 

0.2724 

reg 

1.00 

0.2596 

ntc 

0.00 

0.2754 

ntc 

0.15 

0.3110 

ntc 

1.00 

0.2524 

Table  6:  Results  for  the  subtraction  method  (sample 
Q1/D12) 

Table  6  shows  the  corresponding  results  (o-  =  0  means 
that  single  words  only  are  considered).  In  contrast  to 
what  we  expected,  we  do  not  get  an  improvement  over 
single  words  only  when  phrases  are  considered  fully. 
The  result  for  the  ntc  method  shows  that  still  phrases 
should  be  downweighted.  Possibly,  there  may  be  an  im- 
provement with  this  method  when  we  would  use  binary 
query  term  weights,  but  it  is  clear  that  other  query  term 
weighting  methods  mostly  give  better  results. 

3.3    Official  runs 

As  document  indexing  method,  we  applied  the 
description-oriented  approach  as  described  in  section  2. 
In  order  to  estimate  the  coefficients  of  the  indexing  func- 
tion, we  used  the  training  sample  Q12/D12,  i.e.  the 
query  sets  Qi  and  Q2  in  combination  with  the  docu- 
ments from  Di  and  D2. 

Two  runs  with  different  query  term  weights  were  sub- 
mitted. Run  dortL2  is  based  on  the  nnn  method,  i.e.  tf 
weights.  Run  dortQ2  uses  reg  query  term  weights.  For 
performing  the  regression,  we  used  the  query  sets  Qi 
and  Q2  and  a  sample  of  400,  000  documents  from  Di . 
Table  7  shows  the  results  for  the  two  runs  (Numbers  in 
parentheses  denote  figures  close  to  the  best/worst  re- 
sults.). As  expected,  dortQ2  yields  better  results  than 
dortL2.  The  recall-precision  curves  (see  figure  1)  show 
that  there  is  an  improvement  throughout  the  whole  re- 
call range.  For  precision  average  and  precision  at  1000 
documents  retrieved,  run  dortQ2  performs  very  well, 
while  precision  at  100  documents  retrieved  is  less  good. 
This  confirms  our  interpretation  from  above,  saying  that 


70 


"dortL2"  -9- 
"dortQ2" 


■ — 1       I        I       I  1       I  -1  -I — 

0     0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8    0.9  1 

Recall 


Figure  1:  Recall-precision  curves  of  ad- hoc  runs 


regression  optimizes  the  overall  performance,  but  not 
necessarily  the  retrieval  quality  when  only  the  top  rank- 
ing documents  are  considered.  With  regard  to  the  mod- 
erate results  for  the  reg  query  term  weighting  method, 
the  good  performance  of  dortQ2  obviously  stems  from 
the  quality  of  our  document  indexing  method. 


run 

dortL2  dortq2 

query  term  weighting 

nnn 

reg 

average  precision: 

Prec.  Avg. 

0.3151 

0.3340 

query-wise  comparison 

with  median: 

Prec.  Avg. 

37:2 

45:4 

Prec.  @  100  docs 

35:11 

34:7 

Prec.  @  1000  docs 

37:10 

45:2 

Best /worst  results: 

Prec.  Avg. 

3/0 

3(l)/0 

Prec.  @  100  docs 

3(2)/l 

4(l)/0(2) 

Prec.  @  1000  docs 

6(l)/0 

9(l)/0 

dortL2  vs.  dortq2: 

Prec.  Avg. 

21:29 

Prec.  @  100  docs 

22:24 

Prec.  @  1000  docs 

17:29 

Table  7:  Results  for  adhoc  queries 


4    Query    term    weighting  for 
routing  queries 

4.1    Theoretical  background 

For  the  routing  queries,  the  retrieval-with-probabilistic- 
indexing  (RPI)  model  described  in  [Fuhr  89a]  was  ap- 
plied. The  corresponding  retrieval  function  is  based  on 
the  following  parameters: 

Uim  indexing  weight  of  term  ti  in  document  dm 
set  of  documents  judged  relevant  for  query  qk, 

Pik  expectation  of  the  indexing  weight  of  term  ti  in  D^, 
set  of  documents  judged  nonrelevant  for  query  qk, 
rik  expectation  of  the  indexing  weight  of  ti  in  . 

The  parameters  pik  and  rik  can  be  estimated  based  on 

relevance  feedback  data  as  follows: 


Pik 


Dl 


Then  the  query  term  weight  is  computed  by  the  formula 


Cik 


rik(i  -  Pik) 


1 
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Figure  2:  Recall-precision  curves  of  routing  runs 


and  the  RPI  retrieval  function  yields 

Q{qk,dm)  =    ^  \og{CikUim  +  1). 


(7) 


4.2  Experiments 

In  principle,  the  RPI  formula  can  be  applied  with 
or  without  query  expansion.  For  our  experiments  in 
TRECl,  we  did  not  use  any  query  expansion.  The 
final  results  showed  that  this  was  reasonable,  mainly 
with  respect  to  the  small  amount  of  relevance  feed- 
back data  available.  In  contrast,  for  TREC2  there  were 
about  2000  relevance  judgements  per  query,  so  there  was 
clearly  enough  training  data  for  applying  query  expan- 
sion methods. 

As  basic  criterion  for  selecting  the  expansion  terms,  we 
considered  the  number  of  relevant  documents  in  which  a 
term  ocurs,  which  gave  us  a  ranking  of  candidates;  docu- 
ment indexing  weights  were  considered  for  tie-breaking. 
Then  we  varied  the  the  number  of  terms  which  are  added 
to  the  original  query. 


expansion 

result 

0 

0.2909 

10 

0.3047 

30 

0.3035 

50 

0.3002 

100 

0.2832 

Table  8:  Effect  of  number  of  expansion  terms 

In  a  first  series  of  experiments,  we  considered  single 
word  only.  We  used  Q2/D1  {bp  document  indexing) 
as  training  sample  and  Q2/D2  {Itc  indexing)  as  test 


sample.  As  can  be  seen  from  table  8,  query  expansion 
clearly  improves  retrieval  quality,  but  only  for  a  limited 
number  of  expansion  terms.  For  larger  numbers,  we  get 
worse  results.  This  effect  seems  to  be  due  to  parameter 
estimation  problems. 


expansion 

phraseweight 

single  w. 

phrases 

result 

0.5 

0 

0 

0.3476 

0.5 

20 

0 

0.3713 

1.0 

20 

0 

0.3730 

0.5 

20 

10 

0.3728 

1.0 

20 

10 

0.3605 

0.5 

30 

10 

0.3729 

1.0 

30 

10 

0.3626 

Table  9:  Query  expansion  with  phrases 


In  a  second  series  of  experiments,  we  looked  at  the 
combination  of  single  words  and  phrases.  These  ex- 
periments were  performed  as  retrospective  runs,  with 
Q2/D12  as  training  sample  and  Q2/D2  as  test  sample 
(both  with  Itc  document  indexing).  For  the  number  of 
expansion  terms,  we  treated  single  words  and  phrases 
separately.  Furthermore,  similar  to  the  adhoc  runs,  we 
used  an  additional  factor  for  downweighting  the  query 
term  weights  of  phrases.  The  different  parameter  com- 
binations tested  and  the  corresponding  results  are  given 
in  table  9.  Obviously,  phrases  as  expansion  terms  gave 
no  improvement,  so  we  decided  to  have  only  single  words 
as  expansion  terms  (but  the  phrases  from  the  original 
query  still  are  used  for  retrieval).  Furthermore,  the  re- 
trieval quality  reaches  its  optimum  at  about  20  terms. 
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4.3    Official  runs 

Two  different  runs  were  submitted  for  the  routing 

queries,  both  based  on  the  RPI  model. 

Run  dortPl  uses  the  same  document  indexing  function 

as  for  the  adhoc  queries.   Query  terms  were  weighted 

according  to  the  RPI  formula.  In  addition,  each  query 

was  expanded  by  20  single  words.    Phreises  were  not 

downweighted. 

Run  dortVl  is  based  on  Itc  document  indexing.  Here 
no  query  expansion  took  place. 


run 

dortVl 

dortPl 

document  indexing 

Itc 

Isp 

query  expansion 

none 

20  terms 

average  precision: 

Prec.  Avg. 

0.3516 

0.3800 

query-wise  comparison  with  median: 

Prec.  Avg. 

38:10 

46:4 

Prec.  @  100  docs 

31:11 

40:5 

Prec.  @  1000  docs 

32:9 

37:7 

Best /worst  results: 

Prec.  Avg. 

1/0 

4(2)/0 

Prec.  @  100  docs 

3(3)/l(l) 

7(5)/l(l) 

Prec.  @  1000  docs 

6(2)/0(l) 

10(2)/0(1) 

dortVl  vs.  dortPl: 

Prec.  Avg. 

10:39 

Prec.  @  100  docs 

9:27 

Prec.  @  1000  docs 

7:33 

Table  10:  Results  for  routing  queries 


Table  10  shows  the  results  for  the  two  runs.  The  recall- 
precision  curves  are  given  in  figure  2.  Again,  the  results 
confirm  our  expectations  that  LSP  indexing  and  query 
expansion  yields  better  results. 

5    Conclusions  and  outlook 


procedures,  this  combination  seems  to  be  a  prospective 
area  of  research. 

A    Operational  details  of  runs 
A.l    Basic  Algorithms 

The  algorithm  A  to  find  the  coefficient  vector  a  for  the 
ad-hoc  query  term  weights  can  be  given  as  follows: 

Algorithm  A 

1  For  each  query  document  pair  {qk,dm)  G 
{Q\  U  Q2)  X  D3  with  D,  being  a  sample  from 
(Di  U  D2)  do 

1.1  determine  the  relevance  value  of 
the  document  dm  with  respect  to  the 
query  qk- 

1.2  For  each  term  ti  occuring  in  qk  do 

1.2.1  determine  the  feature  vector  Xi 
and  the  indexing  weight  Uim  of  the 
term  /j  w.r.t.  to  document  dm- 

1.3  For  each  feature  j  of  the  feature  vectors 
X  compute  the  value  of  yj  looping  over 
the  terms  of  the  query. 

1.4  Add  vector  x  and  relevance  value  rkm 
to  the  least  squares  matrix. 

2  Solve  the  leeist  squares  matrix  to  find  the  co- 
efficient vector  a 

The  algorithm  B  to  find  the  coefficient  vector  b  for  the 
document  indexing  is  sketched  here: 

Algorithm  B 

1    Index  Di  [JD2  (the  learning  document  set)  and 
Qi  U  Q2  (the  learning  query  set). 


The  experiments  described  in  this  paper  have  shown 
that  probabilistic  learning  approaches  can  be  applied 
successfully  to  different  types  of  indexing  and  retrieval. 
For  the  ad-hoc  queries,  there  seems  to  be  still  room  for 
further  improvement  in  the  low  recall  range.  In  order  to 
increase  precision,  a  passage-wise  comparison  of  query 
and  document  text  should  be  performed.  For  this  pur- 
pose, polynomial  retrieval  functions  could  be  applied. 
In  the  case  of  the  routing  queries,  we  first  have  to  inves- 
tigate methods  for  parameter  estimation  in  combination 
with  query  expansion.  However,  with  the  large  number 
of  feedback  documents  given  for  this  task,  other  types 
of  retrieval  models  may  be  more  suitable,  e.g.  query- 
specific  polynomial  retrieval  functions. 
Finally,  it  should  be  emphasized  that  we  still  use  rather 
simple  forms  of  text  analysis.  Since  our  methods  are 
flexible  enough  to  work  with  more  sophisticated  analysis 


2  For  each  document  d  ^  Di  U  D2 

2.1    For  each  g  G  Qi  U  Q2 

2.1.1  Determine  the  relevance  value  r  of 
d  to  q 

2.1.2  For  each  term  t  in  common  be- 
tween (set  of  query  terms)  and 
(f^  (set  of  document  terms) 

2.1.2.1  Find  values  of  the  ele- 
ments of  the  relevance 
description  involved  in 
this  run  and  add  values 
plus  relevance  informa- 
tion to  the  least  squares 
matrix  being  constructed 

3  Solve  the  least  squares  matrix  to  find  the  coef- 
ficient vector  b 
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The  algorithm  C  to  index  a  document  set  D  can  now 
be  given  as: 

Algorithm  C 

1    For  each  document  d  £  D 
1.1    For  each  term  t  £  (f^ 

1.1.1    Find  values  of  the  relevance  de- 
scription x{t,d)  involved  in  run. 
:  '  1.1.2    Give  i  the  weight  6  - 

1.1    Add  d  to  the  inverted  file. 


A. 2    Ad-hoc  runs 

The  algorithm  D  is  used  for  indexing  and  retrieval  for 
the  ad-hoc  runs.  Steps  numbered  with  a  trailing  "A" 
apply  only  for  run  dortQ2,  steps  with  trailing  "B"  only 
to  run  dortL2. 

Algorithm  D 

1  Run  algorithm  B  to  determine  the  coefficient 
vector  b  for  document  indexing. 

lA  Run  algorithm  A  to  determine  the  coefficient 
vector  a  for  query  indexing. 

2  Call  algorithm  C  for  document  set  Di  U 

3  For  each  query  qk  G  Q3  do 

3.1  For  each  term  U  occuring  in  qk  do 

3. 1.1  A  Determine  the  feature  vector  Xik 
and  compute  the  query  term 
weight  Cik  by  multiplying  it  to  a. 

3.1. IB  Weight  w.r.t.  qk  (test  query 
set)  with  tf  weights  (nnn  variant). 
Phrases  where  downweighted  by 
multiplying  the  weights  with  a  = 
0.15. 

3.2  Run  an  inner  product  inverted  file  sim- 
ilarity match  of  cjb  against  the  inverted 
file  formed  in  step  2,  retrieving  the  top 
1000. 


Algorithm  E 

lA  Index  query  set  Q2  and  document  set  Di  U  D2 

with  tf  ■  idf  weights. 
IB  Index  query  set  Q2  and  document  set  Di  U  D2 

by  calling  algorithm  C 
2    For  each  query  g  G  Q2 

2.1    For  each  term  t  E  q^  (set  of  query  terms) 
2.1.1    Re  weight  term  t  using  the  RFI  rel- 
evance weighting  formula  andthe 
relevance  information  supplied. 
3 A  Index  document  set  D3  by  calling  algorithm  C. 
3B  Index  document  set  D3  with  tf  ■  idf  weights. 
Note  that  the  collection  frequency  information 
used  was  derived  from  occurrences  in  D\  U  D2 
only  (in  actual  routing  the  collection  frequencies 
within  D3  would  not  he  known). 
4    Run  the  reweighted  queries  of  Q2  (step  2) 
against  the  inverted  file  (step  3),  returning  the 
top  1000  documents  for  each  query. 
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A. 3    Routing  runs 

Algorithm  E  is  used  for  indexing  and  retrieval  for  the 
routing  runs.  Steps  numbered  with  a  trailing  "A"  apply 
only  for  run  dortPl,  steps  with  trailing  "B"  only  to  run 
dortVl.. 
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1    Project  Goals 

The  ARPA  TIPSTER  project,  which  is  the  source  of  the  data  and  funding  for  TREC, 
has  involved  four  sites  in  the  area  of  text  retrieval  and  routing.  The  TIPSTER  project 
in  the  Information  Retrieval  Laboratory  of  the  Computer  Science  Department,  University 
of  Massachusetts,  Amherst  (which  includes  MCC  as  a  subcontractor),  has  focused  on  the 
following  goals: 

•  Improving  the  effectiveness  of  information  retrieval  techniques  for  large,  full-text 
databases, 

•  Improving  the  effectiveness  of  routing  techniques  appropriate  for  long-term  informa- 
tion needs,  and 

•  Demonstrating  the  effectiveness  of  these  retrieval  and  routing  techniques  for  Japanese 
full  text  databases  [4]. 

Our  general  approach  to  achieving  these  goals  has  been  to  use  improved  representations 
of  text  and  information  needs  in  the  framework  of  a  new  model  of  retrieval.  This  model 
uses  Bayesian  networks  to  describe  how  text  and  queries  should  be  used  to  identify  relevant 
documents  [6,  3,  7].  Retrieval  (and  routing)  is  viewed  as  a  probabilistic  inference  process 
which  compares  text  representations  based  on  different  forms  of  linguistic  and  statistical 
evidence  to  representations  of  information  needs  based  on  similar  evidence  from  natural 
language  queries  and  user  interaction.  Learning  techniques  are  used  to  modify  the  ini- 
tial queries  both  for  short-term  and  long-term  information  needs  (relevance  feedback  and 
routing,  respectively). 

This  approach  (generally  known  as  the  inference  net  model  and  implemented  in  the 
INQUERY  system)  emphasizes  retrieval  based  on  combination  of  evidence.  Different  text 
representations  (such  as  words,  phrases,  paragraphs,  or  manually  assigned  keywords)  and 
different  versions  of  the  query  (such  as  natural  language  and  Boolean)  can  be  combined 
in  a  consistent  probabilistic  framework.  This  type  of  "data  fusion"  has  been  known  to  be 
effective  in  the  information  retrieval  context  for  a  ntmiber  of  years,  and  was  one  of  the 
primary  motivations  for  developing  the  inference  net  approach. 

Another  feature  of  the  inference  net  approach  is  the  ability  to  capture  complex  structure 
in  the  network  representing  the  information  need  (i.e.  the  query).  A  practical  consequence 
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of  this  is  that  complex  Boolean  queries  can  be  evaluated  as  easily  as  natural  language  queries 
and  produce  ranked  output.  It  is  also  possible  to  represent  "rule-based"  or  "concept-based" 
queries  in  the  same  probabilistic  framework.  This  has  led  to  us  concentrating  on  automatic 
analysis  of  queries  and  techniques  for  enhancing  queries  rather  than  on  in-depth  analysis 
of  the  documents  in  the  database.  In  general,  it  is  more  effective  (as  well  as  efficient)  to 
analyze  short  query  texts  than  millions  of  document  texts.  The  results  of  the  query  analysis 
are  represented  in  the  INQUERY  query  language  which  contains  a  ntmiber  of  operators, 
such  as  #SUM,  #AND,  #0R,  #NOT,  #PHRASE,  and  #SYN.  These  operators  implement 
different  methods  of  combining  evidence  and  describing  concepts. 

Some  of  the  specific  research  issues  we  are  addressing  are  morphological  analysis  in  En- 
glish and  Japanese,  word  sense  disambiguation  in  English,  the  use  of  phrases  and  other 
syntactic  structure  in  English  and  Japanese,  the  use  of  special  purpose  recognizers  (for 
example,  company,  country  and  people  name  recognizers)  in  representing  docimients  and 
queries,  analyzing  natural  language  queries  to  build  structured  representations  of  informa- 
tion needs,  learning  techniques  appropriate  for  routing  and  structiired  queries,  techniques 
for  acquiring  domain  knowledge  by  corpus  analysis,  and  probability  estimation  techniques 
for  indexing. 

The  first  TREC  evaluation  and  the  two  previous  TIPSTER  evaluations  have  made  it 
clear  that  a  lot  remains  to  be  learned  about  retrieval  in  large,  full-text  databases  based 
on  complex  information  needs.  Issues  as  phrases,  relevance  feedback,  and  probability  es- 
timation have  proven  to  be  quite  difficult  in  such  environments.  On  the  other  hand,  the 
effectiveness  levels  achieved  have  been  quite  good.  The  experiments  done  in  the  TREC- 
2  evaluation,  together  with  the  24  month  TIPSTER  evaluation  which  followed  it,  were 
designed  to  improve  our  understanding  about  which  IR  techniques  work  and  why. 

2    System  Description 

The  document  retrieval  and  routing  system  that  has  been  developed  on  the  basis  of  the  in- 
ference net  model  is  called  INQUERY  [2].  The  main  processes  in  INQUERY  are  document 
indexing,  query  processing,  query  evaluation  and  relevance  feedback. 

In  the  document  indexing  process,  dociraients  axe  parsed  and  index  terms  representing 
the  content  of  documents  are  identified.  INQUERY  supports  a  variety  of  indexing  tech- 
niques including  simple  word-based  indexing,  indexing  based  on  part-of-speech  tagging  and 
phrase  identification,  and  indexing  by  domain- dependent  features  such  as  company  names, 
dates,  locations,  etc.  The  last  type  of  indexing  is  a  first  step  towards  integrating  detection 
and  extraction  systems. 

In  more  detail,  the  document  structure  is  used  to  identify  which  parts  wiU  be  used  for 
indexing.  The  first  step  of  this  process  is  then  to  scan  for  word  tokens.  Most  types  of 
words  (including  numbers)  are  indexed,  although  a  stop  word  list  is  used  to  remove  very 
coHMnon  words.  Stopwords  can  be  indexed,  however,  if  they  are  capitalized  (but  not  at 
the  start  of  sentences)  or  joined  with  other  words  (e.g.  "the  The-1  system").  Words  are 
then  stemmed  to  conflate  variants.  Although  the  Porter  stemmer  was  used  for  the  TREC-2 
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experiments,  we  have  developed  a  new  stemming  algorithm  that  has  a  number  of  advantages 
for  operational  systems.  A  number  of  recognizers  written  in  flex  are  then  used  to  identify 
objects  such  as  company  names  and  mark  their  presence  in  the  document  using  "meta" 
index  terms.  A  company  name  such  as  IBM  in  the  text,  for  example,  will  restdt  in  a  meta 
term  ^COMPANY  being  recorded  at  that  position  in  the  text.  The  use  of  these  meta  terms 
extends  the  range  of  queries  that  can  be  specified.  This  completes  the  usual  processing  for 
docimient  text. 

The  doctraient  indexing  process  also  involves  building  the  compressed  inverted  files 
that  are  necessary  for  efficient  performance  with  very  large  databases.  Since  positional 
information  is  stored,  overhead  rates  axe  typically  about  40%  of  the  original  database  size. 

The  query  processing  process  involves  a  series  of  steps  to  identify  the  important  concepts 
and  structure  describing  a  user's  information  need.  INQUERY  is  unique  in  that  it  can 
represent  and  use  complex  structured  descriptions  in  a  probabihstic  framework.  Many  of 
the  steps  in  query  processing  are  the  same  as  those  done  in  document  indexing.  In  addition, 
a  part-of-speech  tagger  is  to  used  to  identify  candidate  search  phrases.  Domain- dependent 
featiires  are  recognized  and  meta-terms  inserted  into  the  query  representation.  The  relative 
importance  of  query  concepts  is  also  estimated,  and  relationships  between  concepts  axe 
suggested  based  on  simple  grammar  rtdes.  An  evaluation  of  some  of  the  query  processing 
techniques  is  presented  in  [1]. 

INQUERY  also  has  the  capability  of  expanding  the  query  using  relationships  between 
concepts  found  by  either  using  manually  specified  domain  knowledge  in  the  form  of  a  simple 
thesaurus  or  by  corpus  analysis.  The  WORDFINDER  system  is  a  version  of  INQUERY 
that  retrieves  concepts  that  are  related  to  the  query.  WORDFINDER  is  constructed  by 
identifying  noun  groups  in  the  text  and  representing  them  by  the  words  that  are  closely 
associated  with  them  (i.e.  occiir  in  the  same  text  windows).  Concept  "documents"  axe  then 
stored  in  INQUERY.  This  technique  of  query  expansion  was  not  tested  in  TREC-2. 

The  query  evaluation  process  uses  the  inverted  files  and  the  query  represented  as  an 
inference  net  to  produce  a  document  ranking.  The  evaluation  involves  probabilistic  inference 
based  on  the  operators  defined  in  the  INQUERY  language.  These  operators  define  new 
concepts  and  how  to  calculate  the  belief  in  those  concepts  using  linguistic  and  statistical 
evidence.  We  are  constantly  experimenting  with  and  refining  these  operators  (for  example, 
the  operator  defining  a  phrase-based  concept)  in  order  to  improve  retrieval  performance. 

The  relevance  feedback  process  uses  information  from  user  evaluations  of  retrieved  doc- 
uments to  modify  the  original  query  in  detection  or  routing  environments.  The  INQUERY 
system,  because  it  can  represent  structured  queries,  supports  a  wide  range  of  learning  tech- 
niques for  query  modification  [5].  In  general,  new  words  and  phrases  axe  identified  in  the 
sample  of  relevant  documents.  These  axe  added  to  the  original  query  and  all  the  terms 
in  the  query  are  then  reweighted.  With  the  amount  of  relevance  information  available  in 
TIPSTER,  relatively  simple  automatic  techniques  appear  to  produce  good  levels  of  effec- 
tiveness. We  axe  also  investigating  the  effiect  of  using  more  limited  information  and  more 
complex  learning  techniques,  such  as  neural  networks. 
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3    Query  Processing 


In  order  to  claxify  the  query  processing  done  for  the  TREC  and  TIPSTER  experiments 
with  INQUERY,  the  following  sections  give  more  detailed  descriptions. 

There  are  two  main  kinds  of  query  styles:  a  natural  language  query  and  a  keyword  or 
key  concept  query.  For  example,  the  <desc>  and  <narr>  fields  of  a  TIPSTER  query 
represent  natural  language  queries  of  varying  levels  of  abstraction.  The  <con>,  <title> 
and  <f  ac>  fields  represent  key  concepts  in  the  query.  The  main  difference  between  the 
two  types  of  processing  is  that  the  key  concept  query  has  more  controlled  information. 
The  phrasing  and  emphasis  axe  already  given  and  do  not  have  to  be  conjectured  from  the 
language  structiire.  It  is  valuable  to  discover  how  to  treat  both  styles  of  query,  because  a 
good  user  interface  will  maJce  it  easy  for  a  user  to  input  both  styles.  For  example,  a  user 
may  enter  a  prose  query  and  then  highlight  the  important  words  and  phrases  in  the  query 
in  some  convenient  manner.  These  highlighted  words  would  then  be  treated  as  key  concepts 
in  the  query  processing. 

3.1    Prose  query  processing 

Natural  language  query  fields  are  tagged  for  syntactic  category  by  a  part-of-speech  (POS) 
tagger.  Currently  we  use  the  tagger  developed  by  Ken  Church.  We  have  developed  our 
own  POS  tagger,  and  we  expect  to  begin  using  it  in  the  fall  of  1993.  There  are  some  pre- 
tagging  and  post-tagging  "housekeeping"  operations,  such  as  removing  parentheses.  (The 
current  version  of  INQUERY  does  not  permit  parentheses  except  as  part  of  an  operator, 
and  we  do  not  yet  make  any  inferences  from  the  presence  of  parentheses  during  the  text 
processing.)  Additionally,  we  change  operator  phrases  to  single  words  in  order  to  simplify 
later  processing.  An  example  of  this  simplification  is  replacing  the  phrase  in  order  to  with 
the  infinitive  particle  to  or  replacing  with  respect  to  with  the  word  regarding.  The  goal  of 
this  replacement  is  to  remove  phrases  which  resemble  noim  phrases  syntactically  but  which 
are  reaUy  syntactic  operators  (e.g.,  phrasal  prepositions)  with  no  substantive  content.  At 
this  stage,  stop  phrases  are  also  removed. 

3.1.1     Noun  and  adjective  phrase  capture:  orthographic  and  syntactic  clues. 

When  the  text  is  tagged  and  the  potentially  irrelevant  material  has  been  removed,  syntacticaUy- 
based  noim  group  capture  is  performed.  Certain  kinds  of  noun  phrase  patterns  are  enfolded 
in  a  #PHRASE  operator: 

1.  A  noun  phrase  which  contains  more  than  one  modifying  adjective  and  noim  is  enclosed 
in  a  #PHRASE  operator; 

2.  A  head  noim  with  no  premodifiers  and  followed  by  a  prepositional  phrase  is  enclosed 
in  a  :^PHRASE  operator  with  the  head  noun  of  the  prepositional  phrase; 
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3.1.2     Constraint  capture 

All  text  in  the  query  is  seaxched  for  constraint  expressions.  Among  these  expressions  are 
the  words  company,  not  U.  S.  or  a  restriction  in  the  nationaUty  section  of  the  <f  ac>  field 
to  U.S.  or  other  nationality.  A  restriction  to  U.S.  nationality  as  the  area  of  interest  is 
implemented  by  penalizing  documents  for  references  to  foreign  cotmtries.  A  restriction  to 
other  nationalities  is  implemented  by  repeating  that  country  as  a  term.  This  asymmetry 
depends  on  the  fact  that  the  document  collection  is  drawn  solely  from  U.S.  sources,  and 
therefore  the  U.S.,  as  the  default  area  of  interest,  is  rarely  referred  to  imless  the  government 
or  foreign  policy  implementation  is  under  discussion. 

There  is  some  recognition  of  simple  time  expressions,  such  as  since  1984  which  are 
expanded  to  the  set  of  years  which  might  be  intended  by  the  phrase  in  question. 

Countries  are  recognized  as  such  and  are  handled  so  that  expressions  like  South  Africa 
are  phrased  as  #1  (  south  af  rica  )  even  when  they  appear  in  the  middle  of  a  larger  group 
of  capitalized  words.  In  addition,  proper  names  such  as  country  names  are  moved  out  of 
the  scope  of  #PHRASE  operators,  since  it  generally  increases  the  effectiveness  of  a  #PHRASE  to 
reduce  the  number  of  words  in  it.  Nationality  constraints  can  better  be  maintained  within 
the  scope  of  the  larger  and  more  tolerant  #SUM  operator.  For  example  the  phrase 

"import  ban  on  South  African  diamonds" 
becomes  by  stages, 

#PHRASE  (import  ban  on  #SYN  (#1  (south  african)  #1  (south  africa))  diamonds) 
and  finally 

#SUM  (#SYN  (#1 (south  african)  #1 (south  africa)) 
#PHRASE( import  ban  on  diamonds)). 

3.2    Key  concept  query  processing 

Key  concept  query  processing  is  different  from  prose  query  processing  since  the  concept 
separation  provided  by  the  user  can  presumably  be  trusted.  Instead  of  using  a  part-of- 
speech  tagger,  we  rely  on  comma  delimitation  of  concepts,  and  ^PHRASE  the  words  found 
between  each  pair  of  delimiters. 

Additionally,  if  any  constraints  were  foimd  anywhere  else  in  the  query,  e.g.,  a  mention  of 
the  word  company  or  an  exclusionary  geographical  constraint  (e.g.,  not  USA  or  only  USA), 
the  query  wiU  be  modified  according  to  these  constraints.  For  example, 

only  USA      #NOT  (#FOREIGNCOUMTRY  ) 

and 

not  USA      #NOT  (  #USA  ) . 

If  the  word  company  is  found  in  a  query,  then  a  second  copy  of  the  key  concepts  (the 
<con>  field),  is  produced  where  each  item  in  the  field  appears  in  an  unordered  window 
operator  with  the  special  concept  :j^COMPANY.  For  example,  if  the  word  South  Africa 
appears  as  a  key  concept  (and  company  appears  somewhere  in  the  query),  then  the  pre- 
processor would  produce  the  term  #UW50(  #COMPANY  #1(  south  africa))  which  would 
match  any  docimaent  which  had  a  company  name  within  fifty  words  of  South  Africa. 
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4    The  TREC  Experiments 


Four  experiments  were  submitted  to  the  TREC  evaluation,  two  "ad- hoc"  and  two  "routing". 
In  these  experiments,  we  emphasized  automatic  query  processing  and  automatic  feedback 
algorithms  for  routing.  The  following  is  a  summary: 

,  •  AdHoc:  topics  101-150  against  TIPSTER  volumes  1  and  2. 

INQOOl  Created  automatically  from  TIPSTER  topics.  Contains  phrases.  Details  of 
query  processing  used  are  described  below. 

INQ002  INQOOl  queries,  modified  manually.  Modifications  restricted  to  eliminating 
words  and  phrases,  and  adding  paragraph-level  operators  around  existing  words 
and  phrases.  The  method  for  doing  this  was  done  somewhat  differently  than  last 
year's  TREC  conference,  as  discussed  below. 

•  Routing:  topics  51-100  against  TIPSTER  volume  3. 

INQ003  Created  automatically  from  TIPSTER  topics  and  relevance  judgements 
from  Volumes  1  and  2.  Baseline  queries  (from  a  previous  TIPSTER  evalua- 
tion) were  modified  by  reweighting  and  adding  single- word  terms.  The  term 
weighting  and  selection  function  used  was  df.idf,  as  described  in  [5].  Only  the 
top  120  relevant  documents  found  by  INQUERY  were  used  for  feedback,  and  30 
terms  were  added  to  each  query. 

INQ004  Formed  by  combining  (using  the  #SUM  operator)  INQOOl  queries  and  IN- 
QRYP  queries  (used  in  TIPSTER  18  month  evaluation).  The  INQRYP  queries 
were  produced  automatically  and  then  modified  manually.  Modifications  re- 
stricted to  eliminating  words  and  phrases,  and  adding  paragraph- level  operators 
around  existing  words  and  phrases. 


Query  Type  Average  Precision 

5  Docs         30  Docs       100  Docs  11-Pt  Avg 

INQOOl  .62  .57  .49  .36 

INQ002  .60  (-2.6%)  .59  (+3.5%)    .51  (+4.1%)  .36  (0%) 

Table  1:  Results  for  Adhoc  queries 

Table  1  gives  the  results  for  the  adhoc  queries.  These  show  that  there  is  little  difference 
in  effectiveness  between  the  automatically  processed  queries  and  the  semi-automaticaJly 
processed  queries.  The  query  processing  for  the  automatically  processed  queries  has  been 
significantly  improved  as  described  in  the  previous  section,  but  there  is  another  effect. 
Compared  to  the  manual  query  run  in  the  last  TREC  conference,  paragraph-level  concepts 
were  formed  in  a  much  more  mechanistic  way  and  were  constrained  by  the  language  of  the 
description  and  the  narrative.  In  the  previous  conference,  the  only  constraint  was  the  vo- 
cabulary used  in  the  queries,  and  the  user's  "world  knowledge"  was  used  to  group  concepts. 
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This  resulted  in  considerably  better  retrieval  performance.  Additional  experiments  using 
manually  edited  queries  are  discussed  in  the  next  section. 


Query  Type  Average  Precision 

5  Docs         30  Docs      100  Docs  11-Pt  Avg 

INQ003            .64  .56                .45  .35 

INQ004            .67  (+3.7%)  .58  (+2.7%)    .45  (0%)  .36  (+2.4%) 

Table  2:  Results  for  Routing  queries 

The  routing  results  show  that  some  improvement  is  obtained  by  combining  the  manual 
queries  with  the  queries  that  were  automatically  modified  using  relevance  feedback  tech- 
niques. The  difference  in  performance  between  the  two  types  of  queries  is  considerably  less 
than  last  year,  however.  Our  own  experiments  have  also  shown  that  no  additional  gains  in 
performance  were  obtained  by  using  more  than  the  top  150  documents  from  the  INQUERY 
output.  This  is  a  significant  result  from  a  practical  viewpoint,  since  in  an  operational  envi- 
ronment we  will  not  want  to  rely  on  having  output  from  other  systems  or  need  thousands 
of  relevance  judgements  before  performance  improves. 

5     Other  Experiments 

In  the  TIPSTER  24  month  evaluation,  which  took  place  soon  after  the  TREC-2  evaluation, 
we  did  a  number  of  experiments  that  complement  those  done  in  TREC.  In  paxticular,  we 
evaluated  paragraph-based  retrieval,  expansion  using  an  automatically  generated  thesaurus, 
feedback  techniques  that  use  phrases,  and  Japanese  indexing  techniques.  In  this  section,  we 
report  some  of  the  most  interesting  results.  The  precision  figures  given  here  are  calculated 
using  the  TREC-2  relevance  judgements,  rather  than  the  TIPSTER  judgements. 

The  first  two  experiments  were  with  adhoc  queries.  INQ041  (the  ntimbers  are  consistent 
with  those  used  in  TIPSTER  and  other  publications)  is  a  run  that  used  a  different  manually 
modified  version  of  INQOOl.  That  is,  the  manual  modifications  were  the  same  as  those  done 
in  the  first  TIPSTER  and  TREC  evaluations,  rather  than  the  more  restricted  modifications 
done  for  INQ002.  INQ042  is  a  run  that  combines  INQ041  with  INQOOl. 

Query  Type  Average  Precision 

5  Docs         30  Docs       100  Docs  11-Pt  Avg 

INQ041            .68  .60                .50  .36 

INQ042            .65  (-4.6%)  .61  (+1.7%)    .51  (+2.0%)  .38  (+5.6%) 

Table  3:  Results  for  TIPSTER  adhoc  queries 

These  results  show  that  the  manually  modified  queries  can  achieve  significantly  better 
precision  at  low  recall  levels.  For  example,  at  the  5  document  cutoff  level,  the  average 
precision  for  INQ041  is  9.7%  higher  than  INQOOl.  The  overall  average  is  the  same,  however. 
This  is  a  much  smaller  difference  than  was  seen  in  the  first  TREC  and  TIPSTER  evaluations 
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of  the  INQUERY  system  and  it  indicates  that  the  automatic  query  processing  has  improved 
considerably. 

The  combination  search  (INQ042)  is  slightly  worse  than  INQ041  at  the  5  dociraient 
cutoff  level,  but  overall  is  better  than  either  the  automatic  or  manual  queries  on  their  own. 
An  adhoc  search  that  incorporates  automatic  paragraph- level  matching  was  also  tested  in 
TIPSTER  and  this  resulted  in  a  further  5%  improvement. 

E*fQ023  and  INQ024  are  routing  query  sets  that  were  created  automatically  using  rel- 
evance judgements  from  volumes  1  and  2.  In  addition  to  the  single-word  terms  added  in 
INQ003,  10  phrase-level  concepts  and  20  paragraph-level  concepts  were  added  to  the  query. 
A  phrase-level  concept  is  a  #UW5  two- word  pattern  that  occurs  frequently  in  the  relevant 
documents,  and  a  paragraph- level  concept  is  a  t^^UWSO  two- word  pattern.  The  #UWn 
operator  looks  for  co-occurrence  in  any  order  in  a  text  window  of  size  n.  The  difference 
between  INQ023  and  INQ024  is  that  INQ023  contains  the  original  query  terms  in  addition 
to  terms  extracted  from  relevant  docvunents,  whereas  INQ024  contains  only  terms  from 
relevant  documents. 

Query  Type  Average  Precision 

5  Docs         30  Docs       100  Docs  11-Pt  Avg 

INQ023            .67  .60                .47  .38 

INQ024            .68  (-f  1.5%)  .59  (-1.7%)    .46  (-2.2%)  .39  (+2.6%) 

Table  4:  Results  for  TIPSTER  routing  queries 

These  residts  show  that  there  is  little  difference  between  using  the  original  query  or  just 
the  relevant  documents.  This  is  probably  due  to  the  large  nimiber  of  relevance  judgements 
available  in  this  routing  experiment.  In  a  relevance  feedback  situation,  where  there  are  fai 
fewer  relevant  documents,  the  original  query  is  very  important.  It  is  clear  that  the  addition 
of  phrase  and  paragraph-level  structure  to  the  routing  has  improved  performance.  The 
average  precision  for  INQ023  is  8.6%  higher  than  INQ003.  Combining  these  new  rims  with 
manually  modified  routing  queries  produced  further  improvements. 

6  Summary 

The  TREC-2  runs,  both  in  the  adhoc  and  routing  categories,  provided  further  evidence  that 
manually  generated  queries  are  not,  in  general,  superior  to  automatically  processed  natural 
language  queries.  In  the  case  of  routing,  in  fact,  the  manual  queries  are  significantly  less 
effective.  They  do,  however,  improve  the  effectiveness  of  retrieval  when  used  in  combination 
with  the  automatic  queries.  This  combination  of  query  types  has  been  a  theme  of  the 
research  at  the  University  of  Massachusetts  and  has  been  estabUshed  as  effective  in  a  number 
of  experiments. 

The  additional  TIPSTER  rims  showed  that  learning  structure  in  the  form  of  phrases 
and  paragraph-level  co-occurrences  is  effective  for  routing.  They  also  showed  that  learning 
techniques  significantly  improve  performance  (the  best  routing  rims  were  more  than  20% 
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higher  in  terms  of  average  precision  than  the  best  queries  that  were  not  modified  using 
relevance  judgements).  It  is  becoming  apparent  that  techniques  that  may  not  work  well  in 
relevance  feedback  situations  with  few  identified  relevajit  documents,  may  be  very  effective 
in  routing  where  there  are  many  more  relevant  doctiments  identified.  We  are  currently  doing 
experiments  with  different  forms  of  weighting,  including  the  use  of  identified  non-relevant 
documents. 

With  regard  to  improving  the  performance  of  adhoc  queries,  we  are  continuing  to  carry 
out  experiments  with  different  ways  of  estimating  the  probabilities  (or  tf.idf  weights)  needed 
for  the  inference  net,  and  with  different  forms  of  paragraph- level  matching.  Finally,  as  men- 
tioned earlier,  we  have  seen  some  significant  improvements  using  automatic  query  expansion 
based  on  corpus  analysis. 
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1 .  Overview  of  DR-LD*JK's  Approach 

The  theoretical  goal  underlying  the  DR-LINK  System  is  to  represent  and  match  documents  and  queries  at  the  various 
linguistic  levels  at  which  human  language  conveys  meaning.  Accordingly,  we  have  developed  a  modular  system 
which  processes  and  represents  text  at  the  lexical,  syntactic,  semantic,  and  discourse  levels  of  language.  Li  concert, 
these  levels  of  processing  permit  DR-LINK  to  achieve  a  level  of  intelhgent  retrieval  beyond  more  traditional 
approaches.  In  addition,  the  rich  annotations  to  text  produced  by  DR-LINK  are  replete  with  much  of  the  semantics 
necessary  for  document  extraction. 

The  system  was  planned  and  developed  in  a  modular  fashion  and  functional  modularity  has  been  achieved,  while  a 
fuU  integration  of  these  multiple  levels  of  linguistic  processing  is  within  reach.  As  currently  configured,  DR-LINK 
performs  a  staged  processing  of  documents,  with  each  module  adding  a  meaningful  annotation  to  the  text.  For 
matching,  a  Topic  Statement  undergoes  analogous  processing  to  determine  its  relevancy  requirements  for  documents 
at  each  stage.  Among  the  many  benefits  of  staged  processing  are:  improvements  and  changes  can  be  easily  made 
within  any  module;  the  contribution  of  the  various  stages  can  be  empirically  tested  by  simply  turning  them  on  or 
off;  modules  can  be  re-ordered  (as  was  done  within  the  last  six  months)  in  order  to  utilize  document  annotations  in 
various  ways,  and;  individual  modules  can  be  incorporated  in  other  evolving  systems. 

The  purpose  of  each  of  the  processing  modules  will  be  briefly  introduced  here  (also  see  Figure  1)  in  the  order  in 
which  the  system  is  currently  run,  with  fuller  explanations  provided  in  the  section  below:  1)  the  Text  Structurer 
labels  clauses  or  sentences  with  a  text-component  tag  which  provides  a  means  for  responding  to  the  discoiu^  level 
Topic  Statement  requirements  of  tune,  source,  mtentionality,  and  state  of  completion;  2)  the  Subject  Field  Coder 
provides  a  subject-based,  summary-level  vector  representation  of  the  content  of  each  text;  3)  the  Proper  Noun 
Interpreter  and  4)  the  Complex  Nominal  Phraser  provide  precise  levels  of  content  representation  in  the  form  of 
concepts  and  relations,  as  well  as  controlled  expansion  of  group  noims  and  content-bearing  nominal  phrases;  5)  the 
Relation-Concept  Detector  produces  concept-relation-concept  triples  with  a  range  of  semantic  relations  expressed  via 
various  syntactic  classes,  e.g.  verbs,  nominalized  verbs,  complex  nominals,  and  proper  nouns;  6)  the  Conceptual 
Graph  Generator  combines  the  triples  to  form  a  CG  and  adds  Roget  International  Thesaurus  (RET)  codes  to  concept 
nodes,  and;  7)  the  Conceptual  Graph  Matcher  determines  the  degree  of  overlap  between  a  query  graph  and  graphs  of 
those  documents  which  surpass  a  statistically  predetermined  criterion  of  likelihood  of  relevance  based  on  ranking  by 
the  integrated  processing  of  the  first  four  system  modules. 

2.  Detailed  System  Description 

In  the  following  system  description,  emphasis  is  placed  on  work  accomplished  within  the  last  year,  plus  a  basic 
overview  description  of  each  module.  The  more  rudimentary  processing  details  of  each  module  plus  fuller  description 
of  earUer  development  are  available  ui  the  TREC-1  Proceedings  (Harman,  1993). 

2.  A  Text  Structurer 

Since  human  interpretation  of  text  is  influenced  by  expectations  regarding  the  text  to  be  read,  discourse  level  analysis 
is  required  for  a  system  to  approximate  the  same  level  of  meaningful  representation  and  matching.  DR-LINK's  Text 
Structurer  is  based  on  discourse  linguistic  theory  which  suggests  that  texts  of  a  particular  type  have  a  predictable 
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text-level  structure  which  is  used  by  both  producers  and  readers  of  that  text-type  as  an  indication  of  how  and  where 
certain  information  endemic  to  that  text-type  will  be  conveyed.  We  have  implemented  a  Text  Structurer  for  the 
newspaper  text-type,  which  produces  an  annotated  version  of  a  news  article  in  which  each  clause  or  sentence  is  tagged 
for  the  specific  slot  it  instantiates  in  the  news-text  model,  an  extension  of  van  Dijk's  earlier  model  (1988).  The 
structure  annotations  are  used  to  respond  more  precisely  to  information  needs  expressed  in  Topic  Statements,  where 
some  aspects  of  relevancy  can  only  be  met  by  understanding  a  Topic  Statement's  discourse  requirements.  For 
example.  Topic  Statement  75,  states  that: 

Document  will  identify  an  instance  in  which  automation  has  clearly  paid  off,  or  conversely, 
has  failed. 

which  contains  the  implicit  discourse  reqtiirement  tiiat  relevant  instances  should  occur  in  tiie  CONSEQUENCE 
component  of  a  news  article.  DR-LINK  extracts  this  requirement  from  the  Topic  Statement  and  will  only  assign  a 
similarity  value  for  the  discourse-level  of  relevance  to  those  documents  in  which  the  sought  information  occurs  in  a 
CONSEQUENCE  component. 

The  current  news-text  model  consists  of  thirty-eight  recognizable  components  of  information  observed  in  a  large 
sample  of  training  texts  (e.g.  MAIN  EVENT,  VERBAL  REACTION,  EVALUATION,  FUTURE 
CONSEQUENCTE,  PREVIOUS  EVENT).  The  Text  Structurer  assigns  these  component  labels  to  document  clauses 
or  sentences  on  the  basis  of  lexical  clues  learned  from  text,  which  now  comprise  a  special  lexicon.  We  considered 
expanding  the  lexicon  via  available  lexical  resources  such  as  Roget's  International  Thesaurus  or  WordNet.  but  our 
analysis  of  these  resources  suggested  that  they  do  not  capture  the  particularities  of  lexical  usage  in  the  sublanguage 
of  newspaper  reporting. 

The  Text  Structurer  has  recentiy  been  improved  to  assign  structural  tags  at  the  clause  level,  a  refinement  which  has 
corrected  most  of  the  anomalies  that  were  observed  in  earlier  testings  of  the  Text  Structurer.  For  example,  given  the 
new  clause-level  structuring,  the  following  sentence  is  conectiy  interpreted  as  containing  both  future-oriented 
information  in  the  LEAD-FUTURE  segment  and  some  nested  information  regarding  a  past  situation  in  the  LEAD- 
HISTORY  segment. 

<LEAD-FUT>  South  Korea's  trade  surplus,  <LEAD-HIST>  which  more  than  doubled  in  1987 
to  $6.55  billion,  </LEAD-inST>  is  expected  to  narrow  this  year  to  above  $4  billion.  </LEAD- 
FUT> 

We  have  recentiy  implemented  new  matching  techniques  which  more  fully  realize  the  Text  Structmer's  potential 
contribution  to  the  system's  performance.  This  was  achieved  as  one  outcome  of  a  study  which  greatiy  increased  our 
understanding  of  how  text  structure  requirements  in  Topic  Statements  should  be  used  for  matching  documents  to 
Topic  Statements.  Analysis  of  relevant  and  non-relevant  documents  retrieved  for  a  test  sample  of  Topic  Statements 
indicated  tiiat  most  of  the  errors  in  tiie  Text  Structurer's  matching  were  not  serious  errors,  but  only  sUght 
mismatches  in  terms  of  die  conceptual  definitions  of  some  of  the  text  model's  components.  This  suggested  that  our 
model  was  overly  specific  for  the  task  of  responding  to  discourse  aspects  of  information  requirements,  and  that 
matching  Text  Structure  needs  frcMn  a  Topic  Statement  to  structured  documents  called  for  a  more  generahzed  model. 
That  is.  Topic  Statement  text-structure  requirements  are  not  expressed  at  the  same  level  of  specificity  at  which  Text 
Structure  components  are  recognizable  in  documents. 

Given  this,  we  reduced  the  matching  complexity  via  a  function  tiiat  maps  the  thirty-eight  news-text  components  to 
seven  meta-components.  These  are:  LEAD-MAIN,  fflSTORY,  FUTURE,  CONSEQUENCE,  EVALUATION, 
ONCjOING,  and  OTHERS.  The  new  approach  allows  the  system  to  continue  to  impose  the  finer-level,  38- 
ccmponent  structure  on  the  newspaper  articles  themselves  with  excellent  precision,  but  maps  this  fuller  set  of  text 
components  to  the  seven  meta-components  at  the  matching  stage,  as  the  Topic  Statements'  text  structure 
requirements  are  coded  at  the  meta-component  level.  Unofficial  experimental  results  indicate  that  this  new  scheme 
has  significantiy  increased  the  Text  Structurer's  contribution  to  an  improved  level  of  precision  in  the  retiieval  of 
relevant  documents. 
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While  the  Text  Structurer  mcxlule  processes  documents  as  described  above,  the  analysis  of  Topic  Statements  for  their 
Text  Structure  requirements  is  done  by  the  Natural  Language  Query  Constructor  (QC)  which  also  analyzes  the  proper 
noun  and  complex  nominal  requirements  of  Topic  Statements.  The  QC,  as  well  as  the  matching  and  ranking  of 
documents  using  these  sources  of  linguistic  information,  is  described  below. 

2.B.  Subiect  Field  Coder 

The  Subject  Field  Coder  (SFC),  as  reported  at  TElEC-1,  has  been  producing  consistendy  reliable  semantic  vectors  to 
represent  both  documents  and  Topic  Statements  using  the  semantic  codes  assigned  to  word  senses  in  a  machine- 
readable  dictionary.  Details  regarding  this  process  are  reported  in  detail  in  Liddy  et  al,  1993.  Our  more  recent  efforts 
on  this  module  have  focused  on  multiple  ways  to  exploit  SFC-based  similarity  values  between  document  and  query 
vectors.  One  implementation  is  the  use  of  the  ranked  vector-similarity  values  for  predicting  a  cut-off  criterion  of 
potentially  relevant  documents  when  the  module  is  used  as  an  initial  filter.  This  is  to  replace  the  earlier  practice  used 
in  the  eighteenth  month  TIPSTER  testing,  where  documents  were  ranked  by  their  SFC-vector  similarity  to  a  query 
SFC-vector  and  the  top  two  thousand  documents  were  passed  to  the  CG  Matcher,  since  CG  matching  is  too 
computationally  expensive  to  handle  all  documents  in  the  collection.  To  report  the  SFC's  performance,  at  that  time 
we  reported  how  far  down  the  ranked  list  of  documents  the  system  would  need  to  process  documents  in  order  to  get 
all  the  judged  relevant  documents.  Although  the  results  were  highly  promising  (all  relevant  documents  were,  on 
average,  in  the  top  37%  of  the  ranked  list  based  on  SFC  similarity  values),  this  figure  varies  considerably  for 
individual  Topic  Statements.  Therefore,  we  needed  to  devise  a  method  for  predicting  a  priori  for  individual  Topic 
Statements,  the  cut-off  criterion  for  any  desired  level  of  recall.  We  first  developed  a  method  that  could  successfully 
predict  a  cut-off  criterion  based  on  just  SFC  similarity  values.  We  then  extended  the  algorithm  to  incorporate  the 
similarity  values  produced  when  proper  noun,  complex  nominal,  and  text  structure  requirements  are  considered  as 
well,  to  produce  an  integrated  ranking  based  on  these  varied  sources  of  linguistic  information. 

The  SFC-based  cut-off  criterion  uses  a  multiple  regression  formula  which  was  developed  on  the  odd-numbered  Tq)ic 
Statements  from  1  to  50  and  a  training  corpus  of  Wall  Street  Journal  articles.  The  regression  formula  takes  into 
account  the  distribution  of  similarity  values  for  documents  in  response  to  a  particular  query  by  incorporating  the 
mean  and  standard  deviation  of  the  similarity  value  distribution,  the  simiiaxity  of  the  top-ranked  document,  and  the 
desired  recall  level.  The  cut-off  criterion  was  tested  on  the  held-out,  twenty-five  Topic  Statements.  The  averaged 
results,  when  a  user  is  striving  for  100%  recall,  showed  that  only  39.65  %  of  the  173,255  documents  would  need  to 
be  processed  fiulher.  And  this  document  set,  in  fact,  contained  92%  of  the  judged-relevant  documents. 

The  advantage  of  the  cut-off  criterion  is  it's  sensitivity  to  the  varied  distributions  of  SFC  similarity  values  for 
individual  Topic  Statements,  which  appears  to  reflect  how  "appropriate"  a  Topic  Statement  is  for  a  particular 
database.  For  many  queries,  a  relatively  small  portion  of  the  database,  when  ranked  by  similarity  to  the  Topic 
Statement,  will  need  to  be  further  processed.  For  example,  for  Topic  Statement  forty-two,  when  the  goal  is  100% 
recall,  the  regression  formula  predicts  a  cut-off  criterion  similarity  value  which  requires  that  only  13%  of  the  ranked 
ou^ut  be  further  processed,  and  the  available  relevance  judgments  show  that  this  pool  of  documents  contains  99%  of 
the  documents  judged  relevant  for  that  query. 

2.C.  V-8  Matching 

Given  the  complete  modularity  of  die  first  four  modules  in  the  system,  for  the  twenty-four  month  TIPSTER  testing, 
we  reordered  two  modules  so  that  Text  Structuring  is  done  prior  to  Subject  Field  Coding.  This  allowed  us  to 
implement  and  test  a  new  version  of  matching  which  combines  in  a  unique  way  the  Text  Structurer  and  the  Subject 
Field  Coder.  We  refer  to  this  version  as  the  V-8  model,  since  eight  SFC  vectors  are  produced  for  each  docmnent,  one 
for  each  of  the  seven  meta-categories,  plus  one  for  all  of  the  categories  combined.  The  V-8  model,  therefore,  provides 
multiple  SFC  vectors  for  each  document,  thereby  representing  the  distribution  of  SFCs  over  the  various  meta-text 
COTuponents  that  occur  in  a  news-text  document.  This  means,  in  die  V-8  matching,  that  if  certain  content  areas  of  the 
Topic  Statement  are  required  to  occur  in  a  document  in  one  meta-text  compcment,  e.g.  CONSEQUENCE,  and  otiier 
content  is  required  to  occur  in  anotiier  meta-text  component,  e.g.  FUTURE,  this  proportional  division  can  be 
matched  against  the  V-8  vectors  produced  for  each  dociunent  at  a  fairly  abstract,  subject  level.  For  the  TIPSTER 
twenty-four  month  evaluation,  we  have  experimented  with  several  formulas  for  combining  the  similarity  values  of 
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the  multiple  SFC  vectors  produced  for  each  document,  including  both  a  Dempster-Shafer  combination  and  a  straight 
averaging.  Although  official  results  are  not  yet  available,  our  intemal  test  results  indicate  that  the  ccanbination  of 
Text  Structuring  and  Subject  Field  Coding  produces  an  improved  ranking  of  documents,  especially  when  using  the 
Dempster-Shafer  method. 

2.D.  Proper  Noun  Interpreter 

Our  earlier  work  with  the  SFCoder,  suggested  that  the  most  important  factor  in  improving  the  performance  of  this 
upstream  ranking  module,  would  be  to  integrate  the  general  subject-level  representation  provided  by  SFCodes  with  a 
level  of  text  representation  that  enabled  more  refined  discrimination.  Analysis  of  earUer  test  results  suggested  that 
proper  noun  (PN)  matching  that  incorporated  both  particular  proper  nouns  (e.g.  Argentina,  FAA)  as  well  as 
'category'  level  proper  nouns  (e.g.  third-world  country,  government  agency)  would  improve  precision  performance. 
The  Proper  Noun  Interpreter  (Paik  et  al,  1993)  that  we  developed  provides:  a  canonical  representation  of  each  proper 
noun;  a  classification  of  each  proper  noun  into  one  of  tiiirty-seven  categories,  and;  a  means  for  expanding  group 
noims  into  their  constituent  members  (e.g.  all  the  coxmtries  comprising  the  Third  World).  Recent  work  on  our  proper 
noun  algorithms,  context-based  rules,  and  knowledge  bases,  has  improved  the  module's  ability  to  recognize  and 
categorize  proper  nouns  to  93%  correct  categorization  using  37  categories  as  tested  on  a  sample  set  of  545  proper 
nouns  frcHn  newspaper  text  The  improved  performance  has  a  double  impact  on  the  system's  retrieval  performance,  as 
proper  noms  contribute  both  to  the  downstream  relation-concept  representation  used  in  CG  matching  as  well  as  to 
the  upstream  proper  noun,  complex  nominal,  and  text  structure  ranking  of  documents  in  relation  to  individual 
queries.  Details  of  processing  Topic  Statements  for  thek  PN  requirements  and  the  use  of  this  similarity  value  in 
document  ranking  is  described  in  the  later  section  on  the  Query  Constructor. 

2.  E.  Complex  N""^'"al  Phraser 

A  new  level  of  natural  language  processing  has  been  incorporated  in  the  DR-LINK  System  with  the  implementation 
of  the  Complex  Nominal  ((ZN)  Phraser.  The  motivation  behind  this  addition  was  our  recognition  that  either,  in 
addition  to  proper  nouns,  or  in  the  absence  of  proper  nouns,  most  of  the  substantive  content  requirements  of  Topic 
Statements  are  expressed  in  complex  nominals  (i.  e.  noun  +  noun,  e.g.  "debt  reduction",  "government  assistance", 
"health  hazards").  Complex  nominals  provide  a  linguistic  means  for  precise  conceptual  matching,  as  do  proper 
nouns.  However,  the  conceptual  content  of  complex  nominals  can  be  expressed  in  synonymous  phrases,  in  a 
different  way  than  can  the  conceptual  content  of  proper  nouns,  which  are  more  particularized.  Therefore,  for  complex 
nominals,  a  controlled  expansion  step  was  incorporated  in  the  CN  matching  process  in  order  to  accompUsh  the 
desired  goals  of  improved  recall,  as  well  as  improved  precision. 

For  input  to  the  CN  Phraser,  the  complex  nominals  in  Topic  Statements  are  recognizable  as  adjacent  noun  pairs  or 
non-predicating  adjective  +  noun  pairs  m  the  output  of  die  part-of -speech  tagger.  Having  recognized  all  CNs,  the 
substitutable  phrases  for  each  complex  nominal  are  found  by  computationally  determining  the  overlap  of 
synonymous  terms  suggested  by  RTF  and  statistical  corpus  analysis.  These  processes  serve  to  identify  all  second 
order  associations  between  each  complex  nominal  constituent  and  terms  in  the  database.  Second  order  associations 
exist  between  terms  that  are  used  interchangeably  in  certain  contexts.  The  premise  here  is  tiiat  if,  for  example,  terms 
a  and  b  are  both  frequentiy  premodified  by  the  same  set  of  terms  in  a  corpus,  it  is  highly  likely  that  terms  a  and  b  are 
substitutable  for  each  other  within  these  phrases.  The  use  of  both  corpus  and  RTT  mformation  appears  to  limit  the 
over-generation  that  frequentiy  results  fi-om  automatic  term  expansion.  Ongoing  experiments  on  this  new  addition  to 
the  system  will  help  us  further  refine  the  process  and  wUl  be  reported  more  extensively  in  the  near  future. 

The  terms  that  exhibit  second  order  associations  are  compiled  into  equivalence  classes.  These  equivalence  classes 
provide  substitutable  synonymous  phrases  for  Topic  Statement  complex  nominals  and  are  used  by  the  matching 
algorithms  in  the  same  manner  that  the  original  complex  nominals  are  used.  The  complex  nominals  and  thek 
substitutes  are  first  used  in  the  upstream  matching  of  Topic  Statements  to  docxmients  as  one  contributing  factor  to 
the  mtegrated  similarity  value,  to  be  further  explained  in  the  section  on  the  Query  Constructor. 

In  addition,  each  complex  nominal  and  its  assigned  relation  provides  a  CRC  to  the  RCD  module  for  use  in  the  final 
round  of  matching.  For  that  module,  semantic  relations  between  the  constituent  nouns  of  each  complex  nominal  are 
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assigned  manually,  using  an  ontology  of  forty-three  relations.  Some  example  complex  nominals  plus  relation  are: 

[press]  <-  (SOURCE)  <-  [commentaries] 
[growth]  ->  (MEASURE)  ->  [rate] 
[electronic]  <-  (MEANS)  <-  [theft] 
[campaign]  <-  (USED_FOR)  <-  [finances] 

The  development  of  the  complex  nominal  C!RC  knowledge  base  was  an  intellectual  effort  for  the  twenty  four  month 
testing,  but  our  current  task  is  the  full  automation  of  the  semantic  relation  assignment.  Although  difficult,  our 
experience  with  the  intellectual  process  has  encouraged  us  to  pursue  appropriate  NLP-based  machine-learning 
techniques  which  will  enable  the  system  to  automatically  recognize  and  code  semantic  relations  in  complex 
nominals. 

In  CG  matching,  the  existence  of  both  case-frame  relations  and  complex  nominal  relations  make  it  possible  for  the 
system  to  detect  conceptual  similarity  even  if  expressed  in  different  grammatical  structures,  such  as  a  verb  + 
arguments  in  a  Topic  Statement  and  a  complex  nominal  in  a  document,  e.g.: 

"reduce  the  debt"  =  [reduc*]  ->  (OBJECT)  ->  [debt] 
"debt  reduction"  =  [debt]  <-  (OBJECT)  <-  [reduc*] 

To  achieve  the  fullest  exploitation  of  relational  information  despite  grammatical  realization,  a  further  step  was 
necessary  in  order  to  match  on  CRCs  produced  by  verb-based  analysis  and  CRCs  produced  by  complex  nominal 
analysis.  This  required  the  determination  of  the  degree  of  relation-similarity  across  the  two  relation  sets.  There  are 
approximately  sixty  relations  used  in  case  frames,  while  there  are  approximately  forty  relations  used  in  complex 
nominals.  A  relation-similarity  table  was  constructed  that  assigns  a  degree  of  similarity  between  twenty-eight  pairs 
across  the  two  grammatically-distinguished  sets,  and  a  degree  of  similarity  between  pairs  within  the  same  set.  The 
relation-similarity  table  is  used  in  the  final  CG  matching  to  allow  concepts  that  are  linked  by  a  relation  in  a 
document  that  is  different  from  the  relation  that  links  die  same  two  concepts  in  the  Topic  Statement,  to  still  be 
awarded  some  degree  of  similarity.  The  quality  and  appropriateness  of  the  similarity  table  will  be  determined  by  the 
results  of  the  twenty-four  month  testing  which  will  also  provide  empirical  evidence  of  the  Complex  Nominal 
Phraser's  impact  on  performance.  Sample  runs  have  indicated  that  the  inclusion  of  complex  nominals  has  a  strongly 
positive  impact  on  oiu"  results  in  both  of  its  incorporations  in  the  system. 

2.  F.  Natural  Language  Query  Constructor 

We  have  implemented  a  Natural  Language  Query  Constructor  (QC)  for  DR-LE^IK  which  takes  as  input  a  Topic 
Statement  which  has  been  pre-processed  by  straight-forward  techniques,  such  as  part-of-speech  tagging  as  well  as 
SGML-tagging  of  the  meta-language  which  reflects  the  typical  request-presentation  language  used  in  Topic 
Statements  (e.g.  "A  relevant  document  will ..."  or  "To  be  relevant...").  The  QC  produces  a  query  which  reflects  the 
appropriate  logical  combinations  of  the  text  structure,  proper  noun,  and  complex  nominal  requirements  of  a  Topic 
Statement.  The  basis  of  the  QC  is  a  sublanguage  grammar  which  is  a  generaUzation  over  the  regularities  exhibited  in 
the  Topic,  Description,  and  Narrative  fields  of  die  one  hundred  fifty  TIPSTER  Topic  Statements.  It  should  be  noted 
that  the  sublanguage  grammar,  with  minor  modifications,  is  capable  of  handling  non-TIPSTER  queries,  so  its 
generalized  utiUty  is  promising.  Earlier  work  (Liddy  et  al,  1991)  demonstrated  that  the  sublanguage  approach  is  an 
effective  and  efficient  approach  to  natural  language  processing  tasks  witiiin  a  particular  text-type,  here  Topic 
Statements. 

For  the  twenty-four  month  runs,  the  QC  sublanguage  grammar  detects  die  required  logical  combination  of  text 
structure  components,  proper  nouns,  and  complex  nominals.  These  are  the  specific  entities  which  we  consider  to  be 
particularly  revealing  indicators  of  relevant  documents.  In  most  cases,  matching  on  diese  classes  produces  high- 
precision  ranked  results,  although  there  are  some  instances  in  which  single  common  nouns  may  also  be  needed.  After 
analyzing  the  twenty-four  month  results,  we  will  determine  whether  to  expand  the  range  of  linguistic  types  which 
can  be  used  to  instantiate  the  variables  in  the  QCs  logical  assertions. 
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The  QC  sublanguage  grammar  relies  on  function  words  (e.g.  conjunctions,  prepositions,  relative  pronouns),  meta- 
level  phrases  (e.g.  "stich  as",  "examples  of,  "as  well  as"),  and  punctuation  (e.g.  commas,  semi-colons)  to  recognize 
and  extract  the  relevancy  requirements  of  Topic  Statements.  These  linguistic  features  serve  as  clues  to  the 
'organizing'  structure  of  a  Topic  Statement  and  present  each  Topic  Statement's  unique  thematic  content  in  a 
recognizable  frame.  The  QC  sublanguage  mterprets  a  Topic  Statement  into  pattem-action  rules  which  are  used  to 
reduce  each  sentence  in  a  Topic  Statement  into  a  first  order  logic  assertion,  reflecting  the  boolean-like  requirements 
of  Tq)ic  Statements,  including  NOT'd  assertions.  In  addition,  definite  noun  phrase  anaphors  are  recognized  and 
resolved  by  sublanguage  grammar  processing  rules. 

2.G.  Integrated  Matcher 

Each  logical  assertion  produced  by  the  QC  for  a  Topic  Statement  is  evaluated  against  the  entries  in  the  document 
inverted  file  and  a  weight  is  assigned  to  each  segment  of  text  (either  a  clause  or  a  sentence)  which  has  any  similarity. 
The  weighting  scheme  we  are  currently  using  evolved  from  iterative  testing.  Each  segment  of  text  is  indexed  in  the 
inverted  file  with  a  text  structure  component  label  and  will  be  assigned  a  weight  if  it  contains  any  proper  nouns  or 
complex  nominals  that  match  the  Topic  Statement's  requirements.  The  following  weights  are  assigned: 

proper  noun  =  1.00 

complex  nominal  =  1.00 
proper  noun  category  =  0.50 

This  means,  for  example,  that  if,  in  response  to  die  following  requirement  from  a  Topic  Statement: 

A  relevant  document  will  provide  data  on  Japanese  laws,  regulations,  andJor  practices 
which  help  the  foreigner  understand  how  Japan  controls,  or  does  not  control,  stock- 
market  practices  which  could  be  labeled  as  insider  trading. 

a  document  text-segment  contains  'Japanese  law',  and  'stock-market  practice'  (or  one  of  its  synonymous  phrases),  and 
'insider  trading'  (or  one  of  its  synonymous  phrases),  that  segment  is  assigned  a  preliminary  value  of  3.00.  Depending 
on  which  field  in  the  Topic  Statement  the  assertion  came  from,  and  whether  die  document  text-segment  matches  die 
Topic  Statement's  Text  Structure  requirement,  the  preliminary  value  will  be  multipUed  by  one  of  the  following  co- 
efficients: 

Topic  field  and  required  Text  Structure  component  =  1  .(X) 

Desc,  Narr,  or  Concept  field  and  required  Text  Structure  component  =  0.75 

Topic  field  and  non-required  Text  Structure  component  =  0.50 

Desc,  Narr,  or  Concept  field  and  non-required  Text  Shucture  component  =  0.25 

So  if  'Japanese  law'  and  'stock-market  practice'  and  'insider  trading'  were  conceptual  requirements  from  a  Topic  field 
assertion  that  also  required  them  to  occur  in  an  EVALUATION  or  LEAD-MAIN  text  component,  and  they  occiured 
in  a  document  text  segment  which  has  been  tagged  by  the  Text  Stiiicturer  as  EVALUATION,  the  value  of  3  would 
be  multiplied  by  1;  whereas  if  that  assertion  came  from  the  Description  field  in  die  Topic  Statement  and  die  three 
required  phrases  occurred  in  a  document  text  segment  labelled  CONSEQUENCE  by  the  Text  Structurer,  the  value  of 
3  would  be  multiplied  by  .25. 

Since  the  QC  interprets  each  sentence  in  the  Topic.  Description.  Narrative,  and  Concept  fields  m  a  Tq)ic  Statement, 
multiple,  sometimes  overlapping,  sometimes  repetitive  assertions  are  produced  for  a  single  Topic  Statement.  In  tiie 
current  implementation,  each  of  these  Topic  Statement  assertions  is  compared  to  die  inverted  document  file,  and  die 
highest  similarity  value  for  a  single  assertion  in  the  document  is  used  as  that  document's  integrated  similarity  value 
for  that  Topic  Statement. 

The  similarity  value  which  results  from  the  QC  module  matching  is  combined  with  the  SFC  similarity  value  of  the 
document,  and  an  integrated  similarity  score  for  each  document  is  produced.  This  similarity  value  can  be  used  in 
several  ways.  Firstly,  the  two  similarity  values  can  be  used  to  provide  a  full  ranking  of  all  die  docmnents  which 
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takes  into  account  the  lexical,  semantic  and  discourse  sources  of  linguistic  information  in  both  docmnents  and 
queries.  Secondly,  it  can  serve  as  input  to  a  filter  which  uses  a  more  complex  version  of  the  original  cut-off  criterion 
to  determine  how  many  documents  should  be  further  processed  by  the  system's  final  modules. 

For  the  Integrated  Matcher  to  produce  a  combined  ranking,  each  document's  similarity  value  for  a  given  Topic 
Statement  can  be  thought  of  as  being  composed  of  two  elements.  One  element  is  the  SFC  similarity  value  and  one 
element  is  the  similarity  value  that  represents  the  combined  proper  noun,  complex  nominal,  and  text  structure 
similarities.  Additionally,  the  system  will  have  computed  the  regression  formula,  die  mean,  and  standard  deviation  of 
the  distribution  of  the  SFC  similarity  values  for  the  individual  Topic  Statement.  Using  these  statistical  values,  the 
system  produces  the  cut-off  criterion  value.  Since  we  know  from  the  eighteen-month  results,  that  74%  of  the 
relevant  documents  had  what  we  refer  to  as  a  k-value  (then  FN  value;  now  FN,  CN,  TS  values)  and  the  remaining 
26%  of  the  relevant  documents  had  no  k-value,  we  use  this  information  to  predict  what  proportion  of  the  predicted 
relevant  documents  should  come  from  which  segment  of  the  ranked  documents  for  fiiU  recall.  The  combined  ranking 
can  be  envisioned  as  consisting  of  four  segments,  as  shown  in  Figure  2. 


Docs,  having  a  k-value 

&  an  SFC  value 

1     Group  1 

above  the  cut-off 

  cut-off  criterion  SFC  similarity  value 

Docs,  having  a  k-value 

&  an  SFC  value 

1     Group  2 

below  the  cut-off 

Docs,  having  no  k-value 

&  an  SFC  value 

1     Group  3 

above  the  cut-off 

  cut-off  criterion  SFC  similarity  value 

Docs,  having  no  k-value 

&  an  SFC  value 

1     Group  4 

below  the  cut-off 

Fig.  2:  Schematic  of  Segmented  Ranks  from  SFC  &  Integrated  Ranking  (k-value) 


Four  groups  are  required  to  reflect  the  two-way  distinction  mentioned  above.  The  first  distinction  is  between  those 
groups  which  have  a  k-value  and  which  should  contain  74%  of  the  relevant  documents  and  those  documents  without 
a  k-value,  which  should  contribute  26%  of  the  relevant  documents.  The  second  distinction  is  between  those 
documents  whose  SFC  similarity  value  is  above  the  predicted  cut-off  criterion  and  those  whose  SFC  similarity  value 
is  not. 

When  a  cut-off  criterion  is  the  application  desired,  the  system  will  produce  the  ranked  list  in  response  to  a  desired 
recall  level,  by  concatenating  die  documents  above  the  appropriate  cut-off  for  that  level  of  recall  from  (Jroup  1;  then 
docmnents  above  the  appropriate  cut-off  for  that  level  of  recaU  from  Group  3.  However,  since  our  test  results  show 
that  there  is  a  potential  8%  error  m  die  predicted  cut-off  criterion  for  100%  recall,  we  use  extrapolation  to  add  the 
appropriate  proportion  of  the  top  ranked  documents  from  Group  2  to  Group  1,  before  concatenating  documents  from 
Group  3.  These  same  values  are  used  to  produce  die  best  end-to-end  ranking  of  all  the  documents  using  the  various 
segments. 

Document  ranks  are  produced  by  die  Integrated  Matcher  and  the  cut-off  criterion  is  used  either  by  an  individual  user 
who  requires  a  certain  recall  level  for  a  particular  information  need,  or.  as  in  the  twenty  four  month  TIPSTER  test 
situation,  by  die  system  to  determine  how  many  documents  from  die  Integrated  Matcher  ranking  will  be  passed  on  to 
the  final  modules  for  further  processing. 
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2.  H.  Topic  Statement  Processing  for  Conceptual  Graph  Generation 


The  processing  of  topic  statements  for  CG  generation  does  not  make  use  of  the  output  of  the  Natural  Language 
Query  Constructor,  but  instead  the  current  system  first  applies  the  same  RCD  and  CG  generator  modules  to  produce 
topic  statement  (TS)  CGs.  Several  TS-specrfic  processing  requirements  have  been  identified,  some  of  which  have 
been  implemented  as  post-processing  routines  and  others  are  under  development. 

-  Elimination  of  concept  and  relation  nodes  corresponding  to  contentless  meta-phrases  (e.g.  "Relevant 
document  must  identify  ...").  If  both  of  the  concept  nodes  in  a  concept-relation-concept  triple  belong 
to  a  meta-phrase,  the  CRC  is  ignored.  When  only  one  of  them  is  a  meta-phrase  concept,  the  triple  is 
not  removed  blindly  unless  the  other  concept  occurs  m  another  triple. 

-  Handling  of  negated  parts  of  topic  statements.  The  weights  are  adjusted  m  such  a  way  that  an  occurrence  of  the 
negated  concept  in  a  document  will  contribute  to  the  negative  evidence  that  the  document  will  be  relevant.  In 
effect,  the  two  weights  for  the  concept  are  switched. 

-  Automatic  assignment  of  weights  to  concept  and  relation  nodes.  There  are  several  factors  we  consider:  the 
conventional  way  of  determining  the  importance  of  terms  using  inverse  document  frequency  (IDF)  and  total 
frequency;  the  location  of  terms  occurring  in  topic  statements;  the  part  of  speech  information  for  each  term;  and 
indications  m  the  topic  statement  sublanguage  (e.g.  the  document  MUST  contain ...).  Although  we  have 
implemented  a  program  that  tags  individual  words  with  the  degree  of  importance  based  on  the  sublanguage 
patterns,  we  assigned  concept  weights  based  on  IDF  values  of  terms  in  the  collection  for  the  evaluation,  due  to 
time  constraints. 

-  Merging  common  concept  appearing  in  different  sections  of  topic  statements.  Although  it  is  not  safe 

in  general  to  assume  that  two  concepts  sharing  the  same  concept  name  actually  refer  to  the  same  concept 
instantiation  and  merge  them  blindly,  we  have  observed  that  this  is  not  the  case  in  the  topic  statements.  In  fact, 
we  believe  that  it  is  desirable  to  merge  CG  fragments  using  common  concept  nodes.  This  is  an  important  process 
that  eliminates  undesirable  effects  on  scoring.  Without  this,  a  document  containing  a  concept  occurring  repeatedly 
in  <desc>,  <narr>,  and  <con>  fields  would  be  ranked  unnecessarily  high  (or  low  if  it  is  negated)  because  each 
occurrence  of  the  concept  would    make  an  independent  contribution  to  the  overall  score. 

Since  an  integrated  automatic  topic  processing  module  was  not  available,  the  mechanical  aspects  of  the  process  were 
hand-simulated  with  some  parts  done  automatically  and  other  done  manually. 

2. 1.  Relation  Concept  Detector  (RCD) 

The  ouQ)ut  of  the  Complex  Nominal  Phraser  and  the  Proper  Noun  Interpreter  modules  described  above  provide 
concept-relation-concept  triples  directly  to  the  Relation-Concept  Detector  (RCD)  module.  In  addition,  the  following 
RCD  handlers  are  operative. 

One  of  the  more  distinct  aspects  of  the  DR-LINK  system  is  its  capability  of  extracting  and  using  relations  in  the 
final  representation  of  documents  and  topic  statements  in  thek  CG  representations.  This  module  provides  building 
blocks  for  the  CG  representation  by  generating  concept-relation-concept  triples  based  on  the  domain-independent 
knowledge  bases  we  have  been  constructing  with  machine-readable  resources  and  corpus  statistics.  In  this  module, 
there  are  several  handlers  that  are  activated  selectively  depending  on  the  input  sentence. 

2. 1.  1.  Case  Frame  (CF)  Handler 

The  main  function  of  the  CF  Handler  is  to  generate  concept-relation-concept  triples  where  one  of  the  concepts  comes 
typically  from  a  verb.  It  identifies  a  verb  in  a  sentence  and  connects  it  to  other  constituents  surrounding  the  verb. 
Since  the  relaticms  (about  50  we  use  currently)  included  in  our  representation  are  originated  from  the  theories  of 
linguistic  case  roles  (Somers,  1987,  and  Cook,  1989)  and  are  all  semantic  in  nature,  this  module  consults  the 
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knowledge  base  contaming  13,786  case  frames  we  have  constructed,  each  of  which  prescribes  a  pattern  involving  a 
verb  and  the  corresponding  concept-relation-concept  triples. 

Given  a  set  of  case  frames  for  different  senses  of  "decline",  for  example, 

(decline  1       ((PATIENT  subject  ?  obUgatory))) 

(decline  2       ((AGENT  subject  himian  obligatory) 
(PATIENT  object  ?  optional))) 

(decline  3       ((AGENT  subject  human  obligatory) 

(ACnvnY  infinitive  ?  obligatory) 
(link  infinitive  subject  AGENT))) 

AGENT,  PATIENT,  and  ACTIVITY  are  the  relations  that  connect  the  verb  to  other  constituents.  The  second 
components  (e.g.  subject)  prescribe  the  syntactic  categories  of  the  constituents  and  the  third  components  (e.g. 
human)  semantic  restrictions  that  the  subject  and  object  should  satisfy.  The  last  components  (e.g.  obligatory) 
indicate  whedier  the  constituent  must  exist  in  a  sentence  in  order  for  the  particular  case  frame  to  be  instantiated.  The 
last  line  of  the  diird  case  frame  instructs  the  CF  handler  to  link  the  subject  to  the  infinitive  verb  with  the  AGENT 
relation.  This  kind  of  linking  instructions  allow  the  CF  handler  to  produce  triples  containing  non-verbal 
constituents. 

The  CF  Handler  selects  the  best  case  fi-ame  by  attempting  to  instantiate  each  case  frame  and  determine  which  one  is 
satisfied  most  by  the  sentence  at  hand.  This  can  be  seen  as  a  sense  disambiguation  process  using  both  syntactic  and 
semantic  information.  The  semantic  restriction  ioformation  contained  in  the  case  frames  were  obtained  from 
LDOCE,  and  when  the  sentence  is  processed,  the  CF  handler  also  consults  LDOCE  to  get  semantic  restriction 
information  for  individual  constituents  siuxounding  the  verb  in  die  sentence  and  compares  it  with  the  restrictions  in 
the  case  frames  of  the  verb  as  a  way  to  determine  which  case  frame  is  likely  to  be  the  correct  one. 

With  the  following  sentence  fragment, 

...  the  chairman  declined  to  elaborate  on  the  disclosure  ... 

the  CF  handler  chooses  the  third  case  frame  and  produces 

[decline]  ->  (AGENT)  ->  [chairman] 
[decUne]  ->  (ACTIVITY)  ->  [elaborate] 
[elaborate]  ->  (AGENT)  ->  [chairman] 

In  the  current  implementation,  the  input  text  to  the  CF  handler  is  first  tagged  with  part-of -speech  information  and 
bracketed  for  constituent  boundaries.  BBN's  POST  tagger  (Metter  et  al.,  1991)  has  been  used  to  attach  a 
part-of-speech  tag  to  individual  words.  The  constituent  boundary  bracketer  we  developed  then  marks  boundaries  of 
grammatical  constituents  such  as  infinitives,  noun  phrases,  prepositional  phrases,  clauses,  etc. 

At  the  time  of  writing,  the  case  frame  knowledge  base  contains  13, 786  case  frames,  of  which  13,444  are  for  all  the 
verb  entries  (5,206)  in  LDOCE,  and  the  rest  are  for  342  verbs  that  appear  in  die  Wall  Street  Journal  collection  1)ut 
are  not  in  the  LDOCE  as  a  headword.  While  we  have  constructed  case  frames  for  most  of  die  phrasal  verbs  in 
LDOCE,  the  capabiUty  of  processing  phrasal  verbs  has  not 
been  implemented  in  the  current  CF  Handler. 

2. 1.  2.  Nominalized  Verb  (NV)  Handler 

The  nominaUzed  verb  handler  has  been  implemented  for  the  DR-LINK  system  we  ran  for  the  TIPSTER  24th  month 
evaluation.  Its  main  function  is  to  consult  die  NV  case  frames  to  identify  a  NV  in  a  sentence  and  create 
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concept-relation-concept  triples  based  on  the  rule.  At  the  same  time,  it  converts  the  NV  into  its  verb  form.  In  this 
way,  we  can  allow  for  a  match  between  a  CG  fragments  generated  from  a  phrase  containing  verb  and  another 
fragment  generated  from  a  noim  phrase  containing  the  corresponding  nominahzed  verb.  For  example,  the  NV  Handler 
converts  the  sentence  fragment 

...  the  company's  investigation  of  the  incident ... 

into 

[investigate]  ->  (AGENT)  ->  [company] 
[investigate]  ->  (PATENT)  ->  [incident]. 

This  process  is  much  more  than  a  sophisticated  way  of  performing  stemming  in  that  we  canonicalize 
concept-ielation-concept  triples  radier  than  just  concept  nodes. 

For  NV  processing,  15, 053  case  frames  have  been  generated  for  1,593  nominalized  verbs.  Most  of  the  case  frames 
for  NVs  were  automatically  generated  from  the  corresponding  verb  case  frames.  This  process  was  also  faciUtated  by 
identifying  potential  NVs  from  LDOCE. 

No  explicit  testing  of  the  impact  of  NVs  in  mformation  retrieval  has  been  done  yet  although  we  have  convinced 
ourselves  with  anecdotal  evidence  that  this  would  improve  the  retrieval  performance.  More  semantic  processing  of 
nominalized  verbs  in  determining  the  relations  to  the  surroimding  constituents  is  on  the  future  research  agenda.  More 
rigorous  study  on  the  impact  of  NVs  on  information  retrieval  should  be  done,  too. 

2. 1.  3.  Noun  Phrase  (NP)  and  Prepositional  Phrase  (PP)  Handler 

The  noun  phrases  that  are  not  handled  by  the  complex-nominal  handler  or  by  the  nominalized  verb  handler  are 
analyzed  so  that  the  head  noun  is  connected  to  the  concepts  outside  the  noun  phrase  (e.g.  a  verb  concept  in  the  CF 
Handler).  In  addition,  this  module  identifies  individual  concepts  corresponding  to  adjectives  and  other  noims  in  a 
compound  noun  and  connects  them  with  CHARACTERISTIC,  ATTRIBUTE,  or  LINK  relations.  LINK  is  the  most 
generic  relation  in  our  system. 

Once  noun  phrases  are  handled  this  way,  this  module  handles  prepositional  phrases  by  connecting  the  head  noun 
concept  of  the  noun  phrase  to  the  preceding  constituent  (e.g.  a  verb  or  a  noim).  The  preposition  attachment  problem 
is  a  difficiilt  one,  and  the  current  implementation  takes  the  simple-  minded  approach  with  general  relations  such  as 
LINK,  which  can  match  with  many  of  odier  semantically  more  specific  relations.  Our  preliminary  analysis  indicates 
that  this  approach  correcdy  handles  about  75%  of  the  prepositional  phrase  cases  in  the  Wall  Street  Journal 
collection.  More  accurate  and  finer-level  processmg  will  be  done  with  more  semantically  oriented  rules  diat  check 
the  semantic  restrictions  and  use  more  specific  relations.  The  role  of  this  handler  will  be  diminished  when  we  process 
phrasal  verbs  as  part  of  the  CF  handler,  for  which  we  have  constructed  case  frames. 

2.1.4.  Ad-hoc  Handler 

This  module  looks  for  lexical  patterns  not  covered  by  any  of  the  other  special  handlers  discussed  above.  Its 
processing  is  also  driven  by  its  own  knowledge  base  of  patterns  to  infer  relations  between  concepts.  For  example,  a 
sentence  fragment 

...  bought  the  item  for  the  purpose  of  satisfying  ... 
contains  a  pattern 

[VERB] ...  for  the  (ADJ)  purpose  of  [NP] 
in  the  knowledge  base,  and  hence  results  in  a  triple 
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[buy]  ->  (GOAL)  ->  [satisfy] 


The  knowledge  base  contains  a  small  number  of  simple  patterns  involving  BE  verbs  and  more  than  350  pattern  rules 
for  phrasal  patterns  across  phrase  boundaries,  by  which  important  relations  are  extracted.  The  pattern  rules  specify 
certain  lexical  patterns  and  the  order  of  occurrences  of  words  belonging  to  certain  part-of-speech  categories,  and  the 
concept-ielation-concept  triples  to  be  generated.  These  patterns  require  a  processing  capability  no  more  powerful  than 
a  finite  state  automaton.  Due  to  the  time  constraints,  however,  the  current  ad-hoc  handler  has  not  been  generaUzed  to 
process  all  the  patterns,  and  about  30%  of  the  patterns  in  the  knowledge  base  are  recognized  and  handled  correcdy. 

2.  J.  Conceptual  Graph  (CG)  Generator 

After  individxial  RCD  modules  have  generated  concept-relation-concept  triples  for  a  document,  the  CG  generator 
merges  them  to  form  a  set  of  conceptual  graphs,  each  corresponding  to  a  clause  in  most  cases.  Since  more  than  one 
handler  can  generate  different  triples  for  the  same  concept  pairs  (e.g.  a  prepositional  phrase  handled  by  the  C!F  handler 
and  the  NP/PP  handler)  based  on  independently  constructed  rules  and  on  independent  processes,  a  form  of  conflict 
resolution  is  necessary.  In  the  current  implementation,  we  simply  order  the  execution  of  different  handlers  based  on 
the  general  quality  of  the  rules  and  the  resulting  triples  so  tiiat  more  reliable  handlers  have  higher  precedence. 

The  concept  nodes  in  die  resulting  CGs  can  not  only  contain  general  concept  names  but  also  some  instantiations 
(referents)  of  the  concepts.  Such  a  concept  can  be  derived  either  from  a  proper  noun  such  as  a  company  name  or  from 
a  sub-ordiaate  clause.  In  the  latter  case,  the  instantiation  is  a  CG  itself  to  produce  a  CG  like 

[country:  {US}]  <-  (SOURCE-OF-INFO)  <-  [C#:  [[pact]  <-  (PATIENT)  <- ...  ] 

In  the  current  implementation,  concepts  with  the  same  instantiation  are  merged  across  sentences  to  form  a  larger  CG, 
but  concept  with  the  same  label  but  without  any  referents  across  sentences  are  treated  as  separate  concepts  and  are  not 
merged.  A  pronoun  resolution  method  is  being  implemented  to  merge  a  pronoxm  to  its  antecedent  as  a  way  to 
increase  the  connectivity  of  CGs  and  hence  increase  the  usefulness  of  relation  nodes. 

As  a  way  to  make  our  current  representation  more  "conceptual",  we  have  implemented  a  module  that  adds  RTF 
(Roget's  International  Thesaurus)  codes  to  individual  concept  nodes  so  that  the  label  on  the  nodes  is  not  a  word  but  a 
position  of  the  hierarchy  of  RIT.  The  lowest  level  position  beyond  individual  lexical  items  in  the  RTT  hierarchy  is 
called  a  semi-colon  group  consisting  of  several  terms  within  tbe  deUmiter  of  semi-colons,  which  represents  a 
concept. 

The  mapping  from  a  word  (called  target)  in  text  to  a  position  in  RTT  requires  sense  disambiguation,  and  our  approach 
is  to  use  the  words  surrounding  the  target  word  as  the  context  within  which  the  sense  of  the  target  word  is  determined 
and  one  or  more  RIT  codes  are  selected.  The  algorithm  selects  minimal  number  (i.e.  one  or  more)  of  RTT  codes,  not 
just  the  best  one,  for  target  words  since  we  feel  that  some  of  the  sense  distinctions  made  in  RTT  are  unnecessarily 
subtie,  and  it  is  imlikely  diat  any  attempts  to  make  such  fine  distinctions  would  be  successful  and  hence  contribute 
to  information  retrieval. 

We  have  produced  RJT-coded  documents  and  topic  statements  for  the  San  Jose  Mercury  collection  and  the  routing 
queries.  All  the  concept  nodes  derived  from  nouns  now  have  RTT  codes  selected  using  the  surrounding  text  as  the 
context.  Those  concept  nodes  derived  from  verbs  also  have  RTT  codes  but  in  a  different  way.  Instead  of  using  the 
surrounding  text  as  the  context  and  trying  to  disambiguate  senses  (we  concluded  that  this  method  is  not  reliable  for 
verbs),  we  first  assign  RTT  codes  to  each  sense  of  LDOCE  verb  entries  using  the  same  method.  In  this  case  the 
context  become  the  definition  text  in  LDOCE.  Once  we  select  the  right  case  frame  by  Case  Frame  Handler  while  text 
is  processed,  the  RTT  codes  attached  to  the  case  frame  are  automatically  assigned  to  the  target  verb. 

2.K.  Conceptual  Graph  (CG)  Matcher 

The  main  function  of  the  CG  matcher  is  to  determine  the  relevance  of  each  document  against  a  topic  statement  CG 
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and  produce  a  ranked  list  of  documents  as  the  third  and  final  output  of  the  system.  Using  the  techniques  necessary  to 
model  plausible  inferences  with  CGs  (Myaeng  and  Khoo,  1992),  this  module  computes  the  degree  to  which  the  topic 
statement  CG  is  covered  by  the  CGs  in  the  document  (see  Myaeng  and  Liddy  (1993)  and  Myaeng  &  Lopez-lopez 
(1992)  for  details). 

While  the  most  obvious  strength  of  the  CG  approach  is  its  ability  to  enhance  precision  by  exploiting  the  structure 
of  the  CGs  and  the  semantics  of  relations  in  document  and  topic  statement  CGs,  and  by  attempting  to  meet  the 
specific  semantic  constraints  of  topic  statements,  we  also  attempt  to  increase  recall  by  allowing  flexibility  in 
node-level  matching.  Concept  labels  can  be  matched  partially  (e.g.  between  "^Bill  Clmton'  and  'Qinton'),  and  both 
relation  and  concept  labels  can  be  matched  inexactly  (e.g.  between  'aid'  and  'loan'  or  between  'AGENT'  and 
'EXPERIENCER').  For  both  inexact  and  partial  matches,  we  determine  the  degree  of  matching  and  apply  a 
multiplication  factor  less  than  1  to  the  resulting  score.  For  inexact  matching  cases,  we  have  used  a  relation 
similarity  table  that  detennines  the  degree  of  similarity  between  pahs  of  relations.  Aldiough  diis  type  of  matching 
slows  down  the  matching  time,  we  feel  that  until  we  have  a  more  accurate  way  of  determining  the  conceptual 
relations  and  a  way  to  represent  at  a  truly  conceptual  level  (e.g.  our  attempt  to  use  RTF  codes),  it  is  necessary.  More 
importanfly,  the  similarity  table  reflects  om-  ontology  of  relations  and  allows  for  matching  between  relations 
produced  by  different  RCD  handlers  whose  operations  in  turn  are  heavily  dependent  on  the  domain-independent 
knowledge  bases. 

We  have  done  a  series  of  matching  experiments  mtemally  to  evaluate  various  strategies  in  CG  matching/scoring  and 
document  representation  with  the  goal  of  selecting  the  best  one  for  the  final  TIPSTER  24th  month  runs.  The  first 
question  we  had  was  how  to  "normahze"  the  score  assigned  to  a  dociunent  based  on  the  current  scoring  scheme.  As 
described  above,  the  scoring  algorithm  is  query-oriented  in  the  sense  that  the  score  reflects  to  what  extent  the  query 
CG  is  covered  by  the  document  CG.  While  this  approach  is  theoretically  justifiable,  one  potential  drawback  is  that  a 
document  containing  the  entire  query  CG  is  not  ranked  higher  than  one  that  contains  fi-agments  of  the  query  CG 
scattered  in  the  document  as  long  as  they  cover  the  same  query  CG-  That  is  "connectivity"  or  'coherence"  of 
matching  docmnent  CG  is  not  fully  taken  into  accoimt. 

With  the  intuitive  notion  that  the  number  of  matching  CG  fragments  in  a  document  would  be  inversely  proportional 
to  "connectivity",  we  have  been  experimentmg  with  various  normalization  factors  that  are  a  function  of  die  number 
of  matching  CG  fragments.  At  the  time  of  writing,  our  experimental  data  show  that  when  we  consider  12  sentential 
CGs  as  a  unit  (called  "paragraph")  and  use  the  number  of  units  containing  one  or  more  matching  CG  fragments  m 
the  normaUzation  function,  we  obtam  the  best  result.  Among  all  the  functions  we  have  tried,  the  best  normalization 
factor  we  have  found  experimentally  so  far  is: 

l.QS'^d-x) 

where  x  is  the  number  of  text  units  that  contain  one  or  more  matching  CG  fragments.  When  this  is  combined  with 
the  maximum  of  the  scores  assigned  to  individual  "paragraph"  as  foUows: 

S*1.05'^(1-x)-h0.4*M 

where  S  is  for  the  unnormalized  score  and  M  for  the  maximum  "paragraph"  score,  we  obtained  the  best  results.  Since 
we  determined  the  constants  incrementally,  it  is  entirely  possible  that  different  combmation  of  die  constants  can  give 
better  results.  It  is  relatively  clear  based  on  these  experiments  that  the  first  or  die  second  term  alone  are  always 
inferior  to  the  combination.  The  number  of  sentential  CGs  for  "paragraphs".  12,  seems  also  pretty  stable. 

We  have  produced  TIPSTER  runs  using  the  RTT-coded  documents  and  topic  statements.  The  current  matching 
program  attempts  to  match  on  RIT  codes  only  when  the  concept  names  (words)  don't  match.  Because  of  this 
conservative  approach,  die  RTT  codes  do  not  block  a  match  between  two  different  polysemous  words  and  thus  have 
any  direct  impact  on  the  word  ambiguity  problems  in  IR.  With  the  disambiguation  process  employed  when  RTT 
codes  are  chosen  for  a  noun  or  verb,  however,  the  net  effect  is  analogues  to  term  expansion  with  sense 
disambiguation.  It  should  be  noted  tiiat  since  RTT  codes  are  used  for  bodi  document  and  query  concepts,  diis  amoimts 
to  sense-disambiguated  term  expansion  on  both  queries  and  documents. 
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While  the  original  motivation  was  to  represent  documents  and  topic  statements  at  more  conceptual  level  using  RTT 
codes,  we  are  also  testing  the  effectiveness  of  RTT-based  term  expansion  in  IR  environments.  Using  the  scheme  we 
have  developed  for  term  clustering  using  contextual  information  in  the  corpus  (Myaeng  &  Li,  1992),  we  have  three 
methods  to  evaluate:  RTT-based  expansion,  term-cluster  based  expansion,  and  a  combination  of  the  two  so  Uiat  we 
can  eliminate  the  problem  of  using  a  general-  purpose  thesaurus  and  the  errors  made  by  the  term-clustering  method. 

For  TIPSTER  evaluation,  we  have  submitted  two  sets  of  four  runs:  one  with  RTT  codes  and  the  other  without  them. 
Each  set  consists  of  three  runs  for  different  scoring  schemes  and  the  last  one  for  the  combination  of  the  three  runs 
which  appears  to  produce  the  best  result  in  our  intemal  experiment. 

3.  Test  Runs 

The  DR-LINK  group  elected  to  put  their  efforts  into  continued  work  for  the  twenty-four  montii  TIPSTER  testing, 
and  as  a  result  we  lost  our  opportunity  to  have  TREC-compatible  results  to  discuss  at  this  time.  Although  our 
twenty-four  month  TIPSTER  runs  have  been  submitted,  many  of  our  top-ranked  documents  were  not  amongst  those 
submitted  by  TREC  participants,  so  it  is  virtually  impossible  to  make  even  unofficial  reports  on  our  system's 
performance.  We  trust  that  in  the  near  future  there  will  be  some  comparable  groups  and/or  runs  to  measure  ourselves 
against  after  the  results  from  both  TIPSTER  and  TREC-2  are  available. 

As  the  above  descriptions  should  convey,  we  have  made  a  great  deal  of  progress  in  the  development  and  integration 
of  the  DR-LINK  System  since  TREC-1.  Unfortunately,  the  absence  of  quantified  results  of  our  performance  limits 
our  convincing  power.  However,  we  are  pleased  to  have  demonstrated  that  a  system  implementation  of  our  original 
notion  of  integrating  multiple  levels  of  linguistic  processing  so  tiiat  retrieval  can  be  conducted  at  a  conceptual  rather 
than  word-based  level  is  nearly  achieved. 

Many  rich  research  and  implementation  ideas  remain  to  be  explored  in  all  of  the  DR-LINK  modules,  particularly 
those  which  have  only  been  in  existence  for  a  few  months. 
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Abstract 

We  briefly  review  the  MatchPlus  system  and  describe 
recent  developments  with  learning  word  representa- 
tions, experiments  with  relevance  feedback  using  neu- 
ral network  learning  algorithms,  and  methods  for 
combining  diff"erent  output  lists. 

1  Introduction 

HNC  is  developing  a  neural  network  related  approach 
to  document  retrieval  called  MatchPlus^ .  Goals  of 
this  approach  include  high  precision/recall  perfor- 
mance, ease  of  use,  incorporation  of  machine  learning 
algorithms,  and  sensHiviiy  io  similaniy  of  use. 

To  understand  our  notion  of  sensitivity  to  similar- 
ity of  use,  consider  the  four  words:  'car',  'automobile', 
'driving',  and  'hippopotamus'.  'Car'  and  'automo- 
bile' are  synonyms  and  they  very  often  occur  together 
in  documents;  'car'  and  'driving'  are  related  words 
(but  not  synonyms)  that  sometimes  occur  together 
in  documents;  and  'car'  and  'hippopotamus'  are  es- 
sentially unrelated  words  that  seldom  occur  within 
the  same  document.  We  want  the  system  to  be  sen- 
sitive to  such  similarity  of  use,  m^uch  like  a  built-in 
thesaurus,  yet  without  the  drawbacks  of  a  thesaurus, 
such  as  domain  dependence  or  the  need  for  hand- 
entry  of  synonyms.  In  particular  we  want  a  query  on 
'car'  to  prefer  a  document  containing  'drive'  to  one 
containing  'hippopotamus',  and  we  want  the  system 
itself  to  be  able  to  figure  this  out  from  the  corpus. 

The  implementation  of  MatchPlus  is  motivated  by 
neural  networks,  and  designed  to  interface  with  neu- 
ral network  learning  algorithms.  High-dimensional 
(w  300)  vectors,  called  context  vectors,  represent 
word  stems,  documents,  and  queries  in  the  same  vec- 
tor space.   This  representation  permits  one  type  of 

•124  Mt  Auburn  St,  Suite  200.  Cambridge,  MA  02138 
'5501  Oberlin  Drive,  San  Diego.  CA  92121. 
^Patents  pending. 


neural  network  learning  algorithm  to  generate  stem 
context  vectors  that  are  sensitive  to  similarity  of  use, 
and  a  more  standard  neural  network  algorithm  to  per- 
form routing  and  automatic  query  modification  based 
upon  user  feedback,  as  described  below. 

Queries  can  take  the  form  of  terms,  full  documents, 
parts  of  documents,  and/or  conventional  Boolean  ex- 
pressions. Optional  weights  may  also  be  included. 

The  following  sections  give  a  brief  overview  of  our 
implementation,  and  look  at  some  recent  improve- 
ments and  experiments.  For  a  previous  description  of 
the  approach  and  comments  on  complexity  considera- 
tions see  [1];  a  longer  journal  article  is  in  preparation. 

2    The    Context    Vector  Ap- 
proach 

One  of  the  most  important  aspects  of  MatchPlus  is 
its  representation  of  words  (stems),  documents,  and 
queries  by  high  («  300)  dimensional  vectors  called 
context  vectors.  By  representing  all  objects  in  the 
same  high  dimensional  space  we  can  easily: 

1.  Form  a  document  context  vector  as  the 
(weighted)  vector  sum  of  the  context  vectors  for 
those  words  (stems)  contained  in  the  document. 

2.  Form  a  query  context  vector  as  the  (weighted) 
vector  sum  of  the  context  vectors  for  those  words 
(stems)  contained  in  the  query. 

3.  Compute  the  distance  of  a  query  Q  to  any  doc- 
ument. Moreover  if  document  context  vectors 
are  normalized,  the  closest  document  d  (in  Eu- 
clidean distance)  has  the  context  vector  that 
gives  highest  dot  product  with  the  query  context 
vector  V: 

<closest  d>  =  {d\V'^-V'^  is  maximized  for  d  e  D] 

(proof:  yv'-V^^lp  =  ||V'^||-  +  ||V«||--2(V^-V«)  = 
const  -  2(V'^  ■  V'?).) 
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4.  Find  the  closest  document  to  a  given  document 
d  by  treating      as  a  query  vector. 

5.  Perform  relevance  feedback.   If  d  is  a  relevant 
document  for  query  Q,  form  a  new  query  vector 

where  a  is  some  suitable  positive  number  (eg  3). 

(See  also  [8].)  Note  that  search  with  takes 
the  same  amount  of  time  as  search  with  V*^. 

2.1  Context  Vector  Representations 

Context  vector  representations  (or  feature  space  rep- 
resentations) have  a  long  history  in  cognitive  science. 
Work  by  Waltz  &:  Pollack  [10]  had  an  especially  strong 
influence  on  the  work  reported  here.  They  described  a 
neural  network  model  for  word  sense  disambiguation 
and  developed  context  vector  representations  (which 
they  termed  micro-feature  representations).  See  Gal- 
lant [2]  for  more  background  on  context  vector  repre- 
sentations and  word  sense  disambiguation. 

We  use  context  vector  representations  for  docu- 
ment retrieval,  with  all  of  the  representation  being 
learned  from  an  unlabeled  corpus.  A  main  constraint 
for  all  of  this  work  is  to  keep  computation  and  storage 
reasonable,  even  for  very  large  corpora. 

2.2  Bootstrap  Learning 

Bootstrapping  is  a  machine  learning  technique  that 
begins  with  vectors  having  randomly  generated  pos- 
itive and  negative  components,  and  then  uses  an 
unlabeled  training  corpus  to  modify  the  vectors  so 
that  similarly  used  terms  have  similar  representa- 
tions. Previously  we  had  used  partially  hand-entered 
components  as  described  in  [2],  but  we  have  dispensed 
with  all  hand  entry  in  current  im.plementations. 

Although  there  are  important  proprietary  details, 
the  basic  idea  for  bootstrapping  is  to  make  a  stem's 
vector  more  like  its  neighbors  by  adding  a  fraction 
of  their  vectors  to  the  stem  in  question.  We  make 
use  of  a  key  property  of  high-dimensional  vectors: 
the  ability  to  be  'similar  to'  a  multitude  of  vectors. 
This  is  the  same  property  that  allows  the  vector  sum 
that  represents  a  document  to  be  similar  to  individual 
teim  vector  summands.  (Similarity  between  normal- 
ized vectors  is  measured  by  their  inner  product.) 

Note  that  bootstrapping  takes  into  account  local 
word  positioning  when  assigning  the  context  vector 
representation  for  stems.  Moreover  it  is  nearly  in- 
variant with  respect  to  document  divisions  within  the 
training  corpus.  This  contrasts  with  those  methods 


where  stem  representations  are  determined  solely  by 
those  documents  in  which  the  stem  lies. 

2.3  Context  Vectors  for  Documents 

Once  we  have  generated  context  vectors  for  stems  it  is 
easy  to  compute  the  context  vector  for  a  document. 
We  simply  take  a  weighted  sum  of  context  vectors 
for  all  stems  appearing  in  the  document^  and  then 
normalize  the  sum.  This  procedure  applies  to  docu- 
ments in  the  training  corpus  as  well  as  to  new  docu- 
ments. When  adding  up  stem  context  vectors,  we  can 
use  term  frequency  weights  similar  to  conventional  IR 
systems. 

2.4  Context  Vectors  for  Queries;  Rel- 
evance Feedback 

Query  context  vectors  are  formed  similarly  to  docu- 
ment context  vectors.  For  each  stem  in  the  query  we 
can  apply  a  user-specified  weight  (default  1.0).  Then 
we  can  sum  the  corresponding  context  vectors  and 
normalize  the  result. 

Note  that  it  is  easy  to  implement  traditional  rel- 
evance feedback.  The  user  can  specify  documents 
(with  weights)  and  the  document  context  vectors  are 
merely  added  in  with  the  context  vectors  from  the 
other  terms.  We  can  also  find  documents  close  to  a 
given  document  by  using  the  document  context  vec- 
tor as  a  query  context  vector. 

2.5  Retrieval 

The  basic  retrieval  operation  is  simple;  we  find  the 
document  context  vector  closest  to  the  query  context 
vector  and  return  it.  There  are  several  important 
points  to  note. 

1.  As  many  documents  as  desired  may  be  retrieved, 
and  the  distances  from  the  query  context  vector 
give  some  measure  of  retrieval  quality. 

2.  Because  document  context  vectors  are  normal- 
ized, we  may  simply  find  the  document  d  that 
maximized  the  dot  product  with  the  query  con- 
text vector,  V: 

maxiV^  •  V^}. 

d 

3.  It  is  easy  to  combine  keyword  match  with  con- 
text vectors.  We  first  use  the  match  as  a  filter 
for  documents  and  return  documents  in  order  by 
closeness  to  the  query  vectors.   If  all  matching 

^Stopwords  are  discarded. 
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documents  have  been  retrieved,  MatchPlus  can 
revert  to  context  vectors  for  finding  the  closest 
remaining  document. 

4.  MatchPlus  requires  only  about  300  multiplica- 
tions and  additions  to  search  a  document.  More- 
over it  is  easy  to  decompose  the  search  for  a  cor- 
pus of  documents  with  either  parallel  hardware 
or,  less  expensively,  several  networked  conven- 
tional machines  (or  chips).  Each  machine  can 
search  a  subset  of  the  document  context  vectors 
and  return  the  closest  distances  and  document 
numbers  in  its  subset.  The  closest  from  among 
the  distances  returned  by  all  the  processors  then 
determines  the  documents  chosen  for  retrieval. 

We  also  plan  to  investigate  a  cluster  tree  prun- 
ing procedure  that  finds  nearest  neighbor  docu- 
ment context  vectors  without  having  to  compute 
dot  products  for  all  document  context  vectors. 
This  data  organization  affects  retrieval  speed, 
but  does  not  change  the  order  in  which  docu- 
ments are  retrieved. 

3    Experiments  and  Runs  Sub- 
mitted 

The  four  runs  (two  ad-hoc,  two  routing)  submitted 
to  TREC-II  are  described  in  the  following  sections. 

3.1  Totally  Automated 

This  run  used  the  entire  topic  as  a  query,  with  weight 
of  2  applied  to  the  topic  section.  We  also  employed 
a  SMatch  filter  that  put  documents  having  at  least  4 
of  the  concept  terms  before  other  documents.  Con- 
text vectors  determined  the  actual  order  of  the  docu- 
ments (as  well  as  which  documents  were  in  the  1000- 
document  submission  list.)  Note  that  these  ad  hoc 
retrieval  runs  were  totally  automated  from  topic  to 
retrieval  list. 

3.2  Relevance    Feedback    From  the 
First  20  Retrievals 

Here  we  took  the  top  20  documents  from  the  previous 
section,  read  them  to  estimate  relevance,  and  then 
modified  the  initial  query  for  the  remaining  980  re- 
trievals. To  change  a  query  we  added  and  normalized 
all  relevant  document  context  vectors  from  the  first 
20  retrievals  and  then  added  this  vector  to  0.7  times 
the  original  query  vector.  (The  0.7  factor  was  ob- 
tained from  experiments  with  a  difierent  corpus.) 


3.3    Routing  Using  Neural  Network 
Learning  and  Output  Mixing 

Both  routing  runs  used  two  types  of  neural  network 
learning. 

3.3.1  Stem  Weight  Learning 

The  Stem  Weight  Learning  approach  uses  neural  net- 
work learning  to  compute  weights  for  terms  in  a 
query;  there  is  one  neural  network  input  for  every 
query  term.^ 

Every  judged  document  for  a  query  provides  a 
training  example  as  follows.  Let  V"  be  the  context 
vector  for  judged  document  n.  If  query  term  i  has 
context  vector  V,  then  the  i^^  network  input  for 
training  example  n  is  given  by  V  ■  V" ,  the  inner  prod- 
uct of  V*  and  V".  For  training  example  n,  the  desired 
network  output  is  ±1,  according  to  the  relevance  of 
document  n. 

A  fast  single-cell  learning  algorithm,  the  pocket  al- 
gorithm with  ratchet  [3,  4],  generated  weights.  (More 
complex  algorithms  did  not  produce  better  results.) 
These  weights  were  then  used  as  term  weights  for  nor- 
mal query  processing. 

Note  that  Wong,  Yao,  et  al  [9]  previously  used  a 
similar  approach,  applying  a  variant  of  perceptron 
learning  to  learn  weights  with  the  SMART  vector 
space  model. 

3.3.2  Full  Context  Vector  Learning 

This  approach  is  similar  to  the  previous  Stem  Weight 
approach,  except  we  compute  an  entire  query  con- 
text vector  rather  than  weights  for  stems.  Here  we 
use  document  context  vectors  directly  as  inputs  to 
the  network;  for  example  the  i  network  input  for 
training  example  n  is  given  by  V"j.  The  network 
weights  produced  by  learning  are  directly  interpreted 
as  a  query  context  vector. 

For  most  topics,  this  approach  was  not  as  good  as 
the  previous  approach  for  two  apparent  reasons.  First 
the  Stem  Weight  approach  makes  use  of  the  original 
query  terms,  a  valuable  piece  of  user  input.  Second, 
the  Stem  Weight  approach  uses  fewer  (5-50)  trainable 
parameters,  possibly  avoiding  a  tendency  with  the 
Full  Context  Vector  approach  (w  300  parameters)  to 
'overfit  the  data'. 

3.3.3  Routing  Run  #1:  Best  Candidate 

Our  first  routing  run  was  to  take  the  best  candidate 
from  4  sources:  the  two  neural  network  approaches, 

■'Only  terms  in  the  concept  section  of  the  topic  were  used 
for  routing  experiments. 
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a  fully  automated  query  run  (as  with  the  first  ad  hoc 
submission),  and  a  top  20  feedback  run  (as  with  the 
second  ad  hoc  submission).  For  each  topic,  the  best 
method  was  estimated  from  retrievals  on  a  separate 
test  corpus  (not  the  corpus  used  for  submissions). 
The  winning  method's  list  of  1000  retrievals  was  then 
selected  as  the  retrievals  for  this  topic. 

3.3.4    Routing  Run  #2:  Mix  and  Match 

Our  second  routing  run  was  to  mix  retrievals  from  the 
same  4  sources,  using  a  'quality  estimate'  consisting 
of  11-point  recall/precision  scores  determined  from  a 
run  on  a  separate  test  corpus.  Each  document  in  each 
of  the  four  approaches  was  given  points  proportional 
to  the  run  quality  estimate  and  inversely  proportional 
to  its  position  number  on  an  output  list.  Documents 
appearing  on  more  than  one  list  received  points  for 
each  appearance. 

Mix  and  Match  worked  better  than  the  previous 
Best  Candidate  approach. 

4  Comments 

We  are  generally  pleased  with  the  performance  from 
our  one-year-old  system's  results.  (For  final  figures, 
see  the  appendix  of  this  proceedings.) 

In  examining  the  data,  one  interesting  aspect  is 
that  MaichPlus  does  better  when  measured  by  11- 
point  averages  than  by  number  of  relevant  documents 
retrieved.  This  means  a  comparatively  higher  per- 
centage of  documents  in  early  retrievals  were  judged 
relevant.'* 

We  are  now  running  initial  experiments  with  word 
sense  disambiguation  using  context  vectors  and  clus- 
tering. It  will  be  interesting  to  see  whether  word  sense 
disambiguation  can  further  improve  retrieval  perfor- 
mance. 
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1.  Overview  of  Latent  Semantic  Indexing 

Latent  Semantic  Indexing  (LSI)  is  an  extension  of  the 
vector  retrieval  method  (e.g.,  Salton  &  McGill,  1983) 
in  which  the  dependencies  between  terms  are 
explicitly  taken  into  account  in  the  representation  and 
exploited  in  retrieval.  This  is  done  by  sunultaneonsly 
modeling  all  the  interrelationships  among  terms  and 
documents.  We  assume  that  there  is  some  underlying 
or  "latent"  structure  in  the  pattern  of  word  usage 
across  documents,  and  use  statistical  techniques  to 
estimate  this  latent  structure.  A  description  of  terms, 
documents  and  user  queries  based  on  the  underlying, 
"latent  semantic",  structure  (rather  than  surface  level 
word  choice)  is  used  for  representing  and  retrieving 
information.  One  advantage  of  the  LSI  representation 
is  that  a  query  can  be  very  similar  to  a  document  even 
when  they  share  no  words. 

Latent  Semantic  Indexing  (LSI)  uses  singular-value 
decomposition  (SVD),  a  technique  closely  related  to 
eigenvector  decomposition  and  factor  analysis 
(Cullum  and  Willoughby,  1985),  to  model  the 
associative  relationships.  A  large  term-document 
matrix  is  decomposed  it  into  a  set  of  /: ,  typically  100 
to  300,  orthogonal  factors  from  which  the  original 
matrix  can  be  approximated  by  linear  combination. 
Instead  of  representing  documents  and  queries  directiy 
as  sets  of  independent  words,  LSI  represents  them  as 
continuous  values  on  each  of  the  k  orthogonal 
indexing  dimensions.  Since  the  niunber  of  factws  or 
dimensions  is  much  smaller  than  the  number  of  unique 
terms,  words  will  not  be  independent.  For  example,  if 
two  terms  are  used  in  similar  contexts  (documents), 
they  will  have  similar  vectors  in  the  reduced- 
dimension  LSI  representation.  The  SVD  technique 
can  capture  such  structure  better  than  simple  term- 
term  or  document-document  correlations  and  clusters. 
LSI  partially  overcomes  some  of  the  deficiencies  of 
assuming  independence  of  words,  and  provides  a  way 
of  dealing  with  synonymy  automatically  without  the 


need  for  a  manually  constructed  thesaurus.  LSI  is  a 
completely  automatic  method.  (The  Appendix 
provides  a  brief  overview  of  the  mathematics 
underlying  the  LSI/SVD  method.  Deerwester  et  al., 
1990,  and  Furnas  et  al.,  1988  present  additional 
mathematical  details  and  examples.) 

One  can  also  interpret  the  analysis  performed  by  SVD 
geometrically.  The  res;alt  of  the  SVD  is  a  vector 
representing  the  location  of  each  term  and  dociunent 
in  the  ^  -dimensional  LSI  representation.  The  location 
of  term  vectors  reflects  the  correlations  in  their  usage 
across  documents.  In  this  space  the  cosiae  or  dot 
product  between  vectors  corresponds  to  their 
estimated  similarity.  Retrieval  typically  proceeds  by 
using  the  terms  in  a  query  to  identify  a  point  in  the 
space,  and  all  documents  are  then  ranked  by  their 
similarity  to  the  query.  However,  since  botii  term  and 
document  vectors  are  represented  in  the  same  space, 
similarities  between  any  combination  of  terms  and 
documents  can  be  easily  obtained. 

The  LSI  method  has  been  applied  to  many  of  the 
standard  IR  collections  with  favorable  results.  Using 
tiie  same  tokenization  and  term  weightings,  the  LSI 
method  has  equaled  or  ou^rformed  standard  vector 
mediods  and  other  variants  m  almost  every  case,  and 
was  as  much  as  30%  better  in  some  cases  (Deerwester 
et  al.,  1990).  As  with  the  standard  vector  method, 
differential  term  weighting  and  relevance  feedback 
both  improve  LSI  performance  substantially  (Dumais, 
1991).  LSI  has  also  been  applied  in  experiments  on 
relevance  feedback  (Dumais  and  Schmitt.  1991),  and 
in  filtering  applications  (Foltz  and  Dumais.  1992). 

The  recent  MatchPlus  system  described  by  Gallant  et 
al.  (1992)  is  related  to  LSI.  Both  systems  model  the 
relationships  between  terms  by  looking  at  the 
similarity  of  the  contexts  in  which  words  are  used,  and 
exploit  these  associations  to  improve  retrieval.  Both 
systems     use     a    reduced     dimension  vector 
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representation,  but  differ  in  how  the  term,  document 
and  query  vectors  are  formed. 

2.  LSIandTREC-1 

We  used  the  TREC-1  conference  as  an  opportunity  to 
"scale  up"  our  tools,  and  to  explore  the  LSI 
dimension-reduction  ideas  using  a  very  rich  corpus  of 
word  usage.  We  were  pleased  that  we  were  able  to 
use  many  of  the  existing  LSI/SVD  tools  on  the  large 
collection.  (See  Dumais,  1993  for  details.)  We  were 
able  to  compute  the  SVDs  of  50k  docs  x  75k  words 
matrices  without  niunerical  or  convergence  problems 
on  a  standard  Dec5000  or  Sparc  10  workstation. 
Because  of  these  limits  on  the  size  of  the  matrices  we 
could  handle,  we  divided  the  TREC-1  documents  into 
9  separate  subcollections  (API,  DOEl,  FRl.  WSJl, 
ZIFFl,  AP2,  FR2,  WSJ2,  ZIFF2)  and  worked  with 
these.  There  are  also  some  theoretical  reasons  why 
working  with  subcollections  makes  sense.  By  using 
topically  coherent  subcollections  one  can  get  better 
discrimination  among  documents.  When  the  ZIFF 
subcoUection  is  analyzed  separately,  for  example,  2(X) 
or  so  dimensions  can  be  devoted  to  differences  among 
computer-related  topics.  When  these  same  documents 
are  part  of  a  large  corpus,  many  fewer  mdexmg 
dimensions  will  be  devoted  to  discriminating  among 
them. 

In  terms  of  accuracy,  LSI  performance  in  TREC-1 
was  about  average.  There  were  some  obvious 
problems  with  our  initial  pre-processing  of  documents 
(e.g.,  'U.S."  and  "A.T.T."  were  omitted),  and  tiiere 
were  some  unanticipated  problems  in  combining 
across  subcollections.  Since  many  of  the  top 
performing  automatic  systems  used  SMART'S 
preprocessing,  we  chose  to  do  so  as  well  for  TREC-2. 
In  addition,  using  the  SMART  software  for  this 
purpose  allows  us  to  compare  LSI  with  the 
comparable  vector  method,  so  that  we  can  examine 
the  contribution  of  LSI  per  se.  We  also  planned  on 
comparing  an  LSI  analysis  using  subcollections  (from 
TREC-1)  with  an  LSI  analysis  of  the  entire  collection 
for  TREC-2. 

We  had  high  hopes  of  being  able  to  build  on  our 
TREC-1  work  for  TREC-2.  In  practice,  however,  the 
changes  in  the  pre-processing  algorithm,  and  the 
decision  to  use  a  single  combined  LSI  analysis 
resulted  in  our  "starting  from  scratch"  in  many 
respects.  We  have  completed  some  experiments  for 
TREC-2,  but  we  did  not  get  as  far  as  we  would  have 
liked,  especially  for  the  adhoc  topics. 


3.  LSI  and  TREC-2 

3.1  Pre-processing 

We  used  the  SMART  system^  for  pre-processing  the 
documents  and  queries.  Some  markups  (e.g.  o 
delimiters)  were  removed,  and  all  hand-indexed 
entries  were  removed  from  the  WSJ  and  ZIFF 
collections.  Upper  case  characters  were  translated  into 
lower  case,  pimctuation  was  removed,  and  white 
spaces  were  used  to  deUmit  terms.  The  SMART  stop 
list  of  571  words  was  used  as  is.  The  SMART 
stemmer  (a  modified  Lovms  algoridim)  was  used 
without  modification  to  strip  words  endings.  We  did 
not  use:  phrases,  proper  noun  identification,  word 
sense  disambiguation,  a  diesaurus,  syntactic  or 
semantic  parsing,  spelling  checking  or  correction, 
complex  tokenizers,  a  controlled  vocabulary,  or  any 
manual  indexing. 

The  result  of  this  pre-processing  can  be  thought  of  as  a 
term-document  matrix,  in  which  each  cell  entry 
indicates  the  frequency  with  which  a  term  appears  in  a 
document.  The  entries  in  the  term-dociunent  matrix 
were  then  transformed  using  an  "Itc"  weighting.  The 
"Itc"  weighting  takes  the  log  of  individual  cell  entries, 
multiplies  each  entry  for  a  term  (row)  by  the  IDF 
weight  of  the  term,  and  then  normalizes  the  document 
(col)  length. 

We  began  by  processing  the  742358  documents  from 
CD-I  and  CD-2.  Using  the  minimal  pre-processing 
described  above,  there  were  960765  unique  tokens. 
512251  imique  stems,  and  81901331  non-zero  entries 
in  the  term-document  matrix.  742331  documents 
contained  at  least  one  term.  To  decrease  the  matrix  to 
a  size  we  thought  we  could  handle,  we  removed 
tokens  occurring  in  fewer  than  5  documents.  This 
resulted  in  781421  unique  tokens,  104533  unique 
stemmed  words,  and  81252681  non-zero  entries.  We 
used  the  resulting  742331  document  x  104533  term 
matrix  as  the  starting  point  for  results  reported  in  this 
paper.  The  "Itc"  weights  were  computed  on  this 
matrix. 

3.2  SVD  analysis 

The  "Itc"  matrix  described  above  was  used  as  input  to 
the  SVD  algorithm.  The  SVD  program  takes  the  Itc 

1.  The  SMART  system  (version  1 1.0)  was  made  available  through 
the  SMART  group  at  Cornell  University.  Chris  Buckley  was 
especially  generous  in  consultations  about  how  to  get  the 
software  to  do  somewhat  non-standard  things. 
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transformed  term-document  matrix  as  input,  and 
calculates  the  best  "reduced-dimension" 
approximation  to  this  matrix.  The  result  of  the  SVD 
analysis  is  a  reduced-dimension  vector  for  each  term 
and  each  document,  and  a  vector  of  the  singular 
values.  The  nimiber  of  dimensions,  k,  was  between 
200  and  300  in  our  experiments.  This  reduced- 
dimensional  representation  is  used  for  retrieval.  The 
cosine  between  term-term,  document-document,  or 
term-document  vectors  is  used  as  the  measure  of 
similarity  between  them. 

We  were  recently  able  to  compute  the  SVD  analysis 
of  the  full  742k  x  104k  matrix  described  above,  but 
not  in  time  to  include  the  results  in  this  paper.  For  the 
runs  submitted,  we  used  a  sample  of  dociunents  from 
the  above  matrix.  When  appropriate,  the  documents 
that  were  not  sampled  were  "folded  in"  to  the  resulting 
reduced-dimension  LSI  space.  In  all  cases,  we  used 
the  weights  from  the  742k  x  104k  matrix  (and  did  not 
recompute  them  for  our  samples). 

For  the  routing  experiments,  we  used  the  subset  of 
documents  for  which  we  had  relevance  judgements. 
There  were  68385  unique  documents  widi  relevance 
judgements.  The  SVD  analysis  was  computed  on  the 
relevant  68385  document  x  88112  term  subset  of  the 
above  matrix,  containing  14461782  non-zero  cells.  A 
204  reduced-dimensional  approximation  took  18  hrs 
of  CPU  time  to  compute  on  a  Sparc  10  workstation. 
This  204-dimensional  representation  was  used  for 
matching  and  retrieval. 

For  the  adhoc  experiments,  we  took  a  random  sample 
of  70000  documents.  A  reduced-dimensional  SVD 
approximation  was  computed  on  a  69997  document  x 
82968  term  matrix  (7666044  non-zeros).  The 
resulting  199  reduced-dimensional  representation  was 
used  for  retrieval.  The  672331  documents  not 
included  in  this  sample  were  "folded  in"  to  the  199- 
dimension  LSI  space,  and  the  adhoc  queries  were 
compared  against  all  742k  documents. 

These  "folded  in"  documents  were  located  at  the 
weighted  vector  sum  of  their  constituent  terms.  That 
is,  the  vector  for  a  new  document  was  computed  using 
the  term  vectors  for  all  terms  in  the  document.  For 
documents  that  are  actually  present  in  the  term- 
document  matrix,  this  derived  vector  corresponds 
exactly  to  the  doomient  vector  given  by  the  SVD. 
New  terms  can  be  added  in  an  analogous  fashion.  The 
vector  for  new  terms  is  computed  using  the  document 
vectors  of  all  documents  in  which  the  term  appears. 
When  adding  documents  and  terms  in  this  manner,  we 
assume  that  the  derived  "semantic  space"  is  fixed  and 


that  new  items  can  be  fit  into  it.  In  general,  this  is  not 
the  same  space  that  one  would  obtain  if  a  new  SVD 
were  calculated  using  both  the  original  and  new 
documents.  In  previous  experiments,  we  found  that 
sampling  and  scaling  50%  of  the  documents,  and 
"folding  in"  the  remaining  documents  resulted  in 
performance  that  was  indistinguishable  fi^om  that 
observed  when  all  documents  were  scaled.  Here, 
however,  the  scaling  is  based  on  less  than  10%  of  the 
total  corpus. 

We  also  had  (from  TREC-1)  LSI  analyses  of  the  9 
subcoUections  in  CD-I  and  CD-2. 

3.3  Queries  and  retrieval 

Queries  were  automatically  processed  in  the  same  way 
as  documents.  For  queries  derived  from  the  topic 
statement,  we  began  with  the  full  text  of  each  topic 
(all  topic  fields),  and  stripped  out  the  SGML  field 
identifiers.  For  feedback  queries,  we  used  the  full  text 
of  relevant  documents.  A  query  vector  (or  new 
document  in  the  case  of  routing)  indicating  the 
frequency  with  which  each  term  appears  in  the  query 
was  automatically  generated  for  each  topic.  The 
query  was  transformed  using  SMART'S  "Itc" 
weighting. 

Note  that  we  did  not  use  any  Boolean  connectors  or 
proximity  operators  in  query  formulation.  (The 
implicit  connectives,  as  in  ordinary  vector  methods, 
fall  somewhere  between  ORs  and  ANDs,  but  with  an 
additional  kind  of  "fuzziness"  introduced  by  the 
dimension-reduced  association  matrix  representation 
of  terms  and  docimients.) 

The  terms  in  the  query  are  used  to  identify  a  vector  in 
the  LSI  space;  recall  that  each  term  has  a  vector 
representation  in  the  space.  A  query  is  simply  located 
at  the  weighted  vector  sum  of  its  constituent  term 
vectors.  The  cosine  between  the  query  vector  and 
every  document  vector  is  computed,  and  doc\mients 
are  ranked  in  decreasing  order  of  similarity  to  the 
query.  (Although  there  are  many  fewer  dimensions 
tiian  in  standard  vector  retrieval,  the  entries  are  almost 
all  non-zero  so  inverted  indices  are  not  useful.  This 
means  that  each  query  must  be  compared  to  every 
document.) 

For  200-dimensional  vectors,  about  60,000  cosines 
can  be  computed  per  minute  on  a  Sparc2.  This  time 
includes  both  comparison  time  and  ranking  time,  and 
assumes  that  all  document  vectors  are  pre-loaded  into 
memory.  For  the  adhoc  queries,  the  time  to  compare  a 
query  to  the  743k  documents  was  about  10  minutes  if 
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all  comparisons  were  sequential.  It  is,  however, 
straightforward  to  spUt  this  matching  across  several 
machines  or  to  use  parallel  hardware  since  all 
documents  are  independent.  Preliminary  experiments 
using  a  16,(XX)  PE  MasPar  showed  that  60,000  cosines 
could  be  computed  and  sorted  in  less  than  1  second. 

It  is  important  to  note  that  all  step  in  the  LSI  analysis 
are  completely  automatic  and  involved  no  human 
intervention.  Documents  are  automatically  processed 
to  derive  a  term-document  matrix.  This  matrix  is 
decomposed  by  the  SVD  software,  and  the  resulting 
reduced-dimension  representation  is  used  for  retrieval. 
While  the  SVD  analysis  is  somewhat  costly  in  terms 
of  time  for  large  collections,  it  need  is  computed  only 
once  at  the  beginning  to  create  the  reduced-dimension 
database.  (The  SVD  takes  only  about  2  minutes  on  a 
SparclO  for  a  2k  x  5k  matrix,  but  this  time  increases  to 
about  18  hotirs  for  a  60k  x  80k  matrix.) 

3.4  TREC-2:  Routing  experiments 

For  the  routing  queries,  we  created  a  filter  or  profile 
for  each  of  the  50  training  topics.  We  submitted 
results  from  two  sets  of  routing  queries.  In  one  case, 
the  filter  was  based  on  just  the  topic  statements  -  i.e., 
we  treated  the  routing  queries  as  if  they  were  adhoc 
queries.  The  filter  was  located  at  the  vector  sum  of  the 
terms  in  the  topic.  We  call  these  the  routing_topic 
(Isirl)  results.  This  method  makes  no  use  of  the 
training  data,  representing  the  topic  as  if  it  was  an 
adhoc  query.  In  the  other  case,  we  used  information 
about  relevant  documents  from  the  training  set.  The 
filter  in  this  case  was  derived  by  taking  the  vector  sum 
or  centroid  of  all  relevant  documents.  We  call  these 
the  routing_reldocs  (IsirZ)  results.  There  were  an 
average  of  328  relevant  documents  per  topic,  with  a 
range  of  40  to  896.  This  is  a  somewhat  unusual 
variant  of  relevance  feedback;  we  replace  (ratiier  than 
combine)  the  original  topic  with  relevant  documents, 
and  we  do  not  downweight  terms  that  appear  in  non- 
relevant  documents.  These  two  extremes  provide 
baselines  against  which  to  compare  other  methods  for 
combining  information  from  the  original  query  and 
feedback  about  relevant  documents.  In  both  cases,  the 
filter  was  a  single  vector.  New  documents  were 
matched  against  the  filter  vector  and  ranked  in 
decreasing  order  of  similarity. 

The  new  documents  (336306  documents  from  CD-3) 
were  automatically  processed  as  described  in  section 
3.2  above.  It  is  important  to  note  that  only  terms  from 
the  CD-I  and  CD-2  training  collection  were  used  in 
indexing  these  documents.  Each  new  document  is 
located  at  the  weighted  vector  sum  of  its  constituent 


term  vectors  in  the  204-dimension  LSI  space  (in  just 
the  same  way  as  queries  are  handled).  New 
documents  were  compared  to  each  of  the  50  routing 
filter  vectors  using  a  cosme  similarity  measure  in 
204-dimensions.  The  1000  best  matching  documents 
for  each  filter  were  submitted  to  NIST  for  evaluation. 

3.4.1  Results 

The  main  results  of  the  Isirl  and  lsir2  runs  are  shown 
in  Table  1 .  The  two  nms  differ  only  in  how  the  profile 
vectors  were  created  -  using  the  weighted  average  of 
the  words  in  the  topic  statement  for  Isirl 
routing_topic,  and  using  the  weighted  average  of  all 
relevant  documents  from  the  training  collection  (CD-I 
and  CD-2)  for  lsir2  routing_reldocs.  Not 
surprisingly,  the  lsir2  profile  vectors  which  take 
advantage  of  the  known  relevant  documents  do  better 
than  the  Isirl  profile  vectors  that  sunply  use  the  topic 
statement  on  all  measures  of  performance.  The 
improvement  in  average  precision  is  31%  (.2622  vs. 
.3442).  Users  would  get  an  average  of  1  additional 
relevant  document  in  the  top  10  returned  using  the 
lsir2  method  for  filtering. 


Table  1 


Isirl 

lsir2 

rl+r2 

(topic  wds) 

(rel  docs) 

(sum  rl  r2) 

ReLret 

6522 

7155 

7367 

Avg  prec 

.2622 

.3442 

.3457 

Prat  100 

.3799 

.4524 

.4394 

Prat  10 

.5480 

.6660 

.6620 

R-prec 

.3050 

.3804 

.3786 

Q  >=  Median 

27(4) 

40(9) 

42(6) 

Q  <  Median 

23(0) 

10  (0) 

8(0) 

Table  1:  LSI  Routing  Results.  Comparison  of  topic 
words  vs.  relevant  documents  as  routing  filters. 


Compared  to  other  TREC-2  systems,  LSI  does 
reasonably  well,  especially  for  the  routing_reldocs 
(lsir2)  run  (and  the  rl+r2  run  to  be  discussed  below). 
In  the  case  of  lsir2,  LSI  is  at  or  above  the  median 
performance  for  40  of  the  50  topics,  and  has  the  best 
score  for  9  topics.  LSI  performs  about  average  for  the 
routing_topic  (Isirl)  run  even  though  no  information 
from  the  training  set  was  used  m  forming  the  routing 
vectors  m  this  case  (except,  of  course,  for  the  global 
term  weights). 

We  have  also  performed  similar  comparisons  between 


108 


query  vectors  representing  the  words  in  queries  and 
the  centroid  of  all  relevant  documents  for  some  of  the 
standard  IR  test  collections  (Med,  QSI,  Ganfield, 
CACM,  Time).  In  these  cases,  we  found  an  average 
improvement  of  107%  when  the  query  was  replaced 
by  the  centroid  of  all  relevant  documents.  The 
improvement  was  67%  when  the  top  three  relevant 
documents  were  used,  and  33%  when  just  the  first 
relevant  document  was  used.  The  smaller  advantages 
observed  in  TREC-2  are  partially  due  to  statistical 
artifacts,  and  partially  to  the  TREC  topics  which  are 
much  richer  need  statements  than  the  usual  IR  queries. 
(We  also  examined  topic  and  reldocs  profiles  in 
TREC-1.  Somewhat  surprisingly,  the  query  using  just 
the  topic  terms  was  about  25%  more  accurate  than  the 
query  using  relevant  documents  from  training.  This  is 
attributable  to  the  small  number  and  inaccuracy  of 
relevance  judgements  in  the  initial  training  set  for 
TREC-1.  This  had  substantial  impact  on  performance 
for  some  topics  because  our  reldocs  queries  were 
based  only  on  the  relevant  articles  and  ignored  the 
original  topic  description.) 

The  Isirl  and  lsir2  runs  provide  baselines  against 
which  various  combinations  of  query  information  and 
relevant  docimient  information  can  be  measured.  We 
have  tried  a  simple  combination  of  the  Isirl  and  lsir2 
profile  vectors,  in  which  both  components  have  equal 
weight  That  is,  we  took  the  sum  of  the  Isirl  and  lsir2 
profile  vectors  for  each  of  the  topics  and  used  this  as  a 
profile  vector.  The  results  of  this  analysis  are  shown 
in  the  third  column  of  the  table  labeled  rl+r2.  This 
combination  does  somewhat  better  than  the  centroid  of 
the  relevant  documents  in  the  total  number  of  relevant 
documents  returned  and  in  average  precision.  (We 
returned  fewer  than  1000  documents  for  5  of  die 
topics  and  not  all  documents  returned  by  the  rl+r2 
method  had  been  judged  for  relevance,  so  we  suspect 
that  performance  could  be  improved  a  bit  more.)  For 
27  of  the  topics,  rl+r2  was  better  than  the  maximum 
of  the  other  two  methods.  It  was  never  more  than 
about  10%  worse  than  the  best  method.  Thus  it 
appears  that  this  combination  takes  advantage  of  the 
best  of  both  methods. 

The  rl+r2  method  which  combines  a  query  vector 
with  a  vector  representing  the  centroid  of  all  relevant 
documents  is  a  kind  of  relevance  feedback.  This  is  an 
unmual  variant  of  relevance  feedback  since  all  the 
words  in  relevant  documents  are  used,  words  in  non- 
relevant  documents  are  not  down-weighted,  and  query 
terms  are  not  re-weighted.  Interestingly,  this  method 
appears  to  produce  improvements  that  are  comparable 
to  those  obtained  by  Buckley,  Allan  and  Salton  (1993) 
using  more  traditional  relevance  feedback  methods. 


Average  precision  for  the  rl+r2  method  is  31%  better 
than  for  Isirl  which  used  only  the  topic  words  (.3457 
vs.  .2622),  and  this  is  quite  similar  to  the  38% 
improvement  reported  by  Buckley,  Allan  and  Salton 
(1993)  for  their  richest  routing  query  expansion 
method. 

The  lsir2  method  is  generally  better  than  the  Isirl 
method,  but  there  is  substantial  variability  across 
topics.  The  topics  on  which  diere  are  the  largest 
differences  are  generally  those  in  which  the  cosine 
between  the  the  Isirl  and  lsir2  topic  vectors  are 
smallest.  The  cosines  between  corresponding  topic 
vectors  range  from  .87  to  .54.  The  lsrr2  method  is 
substantially  better  on  topics:  71  (incursions  by 
foreign  miUtary  or  guerrilla  groups),  73  (movement  of 
people  from  one  country  to  anotiier),  87  (criminal 
actions  against  officers  of  failed  financial  institution). 
94  (crime  perpetrated  with  die  aid  of  a  computer),  98 
(production  of  fiber  optics  equipment).  There  are  a 
few  topics  for  which  Isirl  is  substantially  better  than 
lsir2:  63  (machine  translation  system),  65  (information 
retrieval  system),  85  (actions  against  corrupt  pubUc 
officials),  95  (computer  application  to  crime  solving). 
It  is  not  entirely  clear  what  distinguishes  between 
these  topics,  especially  topics  94  and  95,  for  example. 

We  have  not  yet  had  time  to  look  in  detail  at  the 
failures  of  the  LSI  system.  We  will  examine  both 
misses  and  false  alarms  in  more  detail.  A  preliminary 
examination  of  a  few  topics  suggests  that  lack  of 
specificity  is  die  main  reason  for  false  alarms  (highly 
ranked  but  irrelevant  documents).  This  is  not 
surprising  because  LSI  was  designed  as  a  recaU- 
enhancing  method,  and  we  have  not  added  precision- 
enhancing  tools  altiiough  it  would  be  easy  to  do  so. 

We  would  also  like  to  examine  some  query  spUtting 
ideas.  We  have  previously  conducted  experiments 
which  suggest  that  performance  can  be  improved  if 
the  filter  is  represented  as  several  separate  vectors.  We 
did  not  use  this  method  for  die  TREC-2  results  we 
submitted,  but  would  like  to  do  so.  (See  also  Kane- 
Esrig  et  al.,  1991  or  Foltz  and  Dumais,  1992,  for  a 
discussion  of  multi-point  interest  profiles  in  LSI.) 

3.5  TREC-2:  Adhoc  experiments 

We  submitted  two  sets  of  adhoc  queries  -  Isiasm  and 
Isial.  We  had  mtended  to  compare  the  new  SMART 
pre-processing  (Isiasm)  and  a  single  LSI  space  (Isial) 
with  our  old  TREC-1  pre-processing  and  9  separate 
subcollection  spaces.  Unfortunately,  there  were  some 
serious  errors  in  our  translation  between  internal 
document   numbers   and   Uie   <DOCNO>  labels 
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(documents  not  in  the  LSI  scaling  were  mislabeled), 
so  the  Isial  run  results  are  incomplete  and  misleading. 
We  have  corrected  this  translation  problem,  and  the 
correct  results  are  labeled  Isial*.  These  results  are 
summarized  in  Table  2.  We  have  not  yet  completed 
the  comparison  against  the  9  separate  subspaces  from 
TREC-1. 


Table  2 


Isiasm 

Isial 
error 

Isial* 
correct 

ReLret 

7869 

4756 

6987 

Avg  prec 

.3018 

.1307 

.2505 

Prat  100 

.4306 

.2664 

.3922 

Prat  10 

.5020 

.3340 

.5100 

R-prec 

.3580 

.1937 

.3069 

Q  >=  Median 

37(2) 

16(1) 

25(1) 

Q<  Median 

13(0) 

34(7) 

25  (0) 

Table  2:  LSI  Adhoc  Results.  Comparison  of 
standard  vector  method  with  LSI  (corrected 
version,  but  missing  relevance  judgements). 

In  terms  of  absolute  levels  of  performance,  both 
Isiasm  and  Isial*  are  about  average.  The  SMART 
results  (Isiasm)  are  somewhat  worse  than  the  TREC-2 
SMART  results  reported  by  Buckley  et  al.,  Fuhr  et  al., 
or  Voorhees.  but  this  is  because  we  used  sUghtly 
different  pre-processing  options  and  did  not  include 
phrases.  Although  it  is  generally  difficult  to  compare 
across  systems,  the  SMART  (Isiasm)  and  LSI  (Isial*) 
runs  can  meaningfully  be  compared  since  both  use  the 
same  pre-processing.  The  starting  term-document 
matrix  was  the  same  in  both  cases.  Much  to  our 
disappointment,  the  reduced-dimension  LSI 
performance  appears  to  be  somewhat  worse  than  the 
comparable  SMART  vector  method.  However,  it  is 
important  to  reaUze  that  many  of  the  documents 
returned  by  Isial*  were  not  judged  for  relevance 
because  they  were  not  submitted  as  an  official  run. 
Table  3  shows  the  number  of  documents  for  which 
there  are  no  judgements.  Consider  the  results  for  just 
the  top  100  documents  for  each  query  (i.e.,  the 
documents  judged  by  the  NIST  assessors).  For  Isiasm, 
all  5000  documents  were  judged  since  this  was  an 
official  run,  and  2153  were  relevant.  For  Isial*,  only 
4073  docmnents  were  judged  and  almost  as  many, 
2122,  were  relevant.  Thus,  if  only  31  of  the  927 
unjudged  Isial*  documents  are  relevant  LSI 
performance  would  be  comparable  to  SMART 
performance,  and  if  more  than  31  were  relevant  LSI 


performance  would  be  somewhat  better.  Sunilarly  for 
the  toplOOO  documents,  Isial*  had  more  than  4000 
more  documents  without  relevance  judgements  than 
did  Isiasm. 

Table  3 

Isiasm       Isial*        Isiasm  Isial* 
 toplOO     toplOO     toplOOO  toplOOO 


relevant  2153        2122        15559  12230 

not-relevant        2847        1961  7869  6987 

not-judged  0         927        26572  30694 

Table    3:    Summary    of    missing  relevance 
judgements  for  standard  vector  method  and  LSI. 

Because  the  missing  relevance  judgements  make 
direct  comparisons  between  SMART  and  LSI  difficult, 
we  decided  to  look  at  performance  for  just  the 
documents  for  which  we  had  relevance  judgements. 
That  is,  we  looked  at  performance  considering  just  the 
38175  unique  documents  for  which  we  have  adhoc 
relevance  judgements.  These  results  are  shown  m 
Table  4. 


Table  4 

Isiasm 

Isial* 

38175 

38175 

ReLret 

9493 

9596 

Avg  prec 

.3700 

.3789 

Prat  100 

.4306 

.4466 

Prat  10 

.5020 

.5220 

R-prec 

.3977 

.3995 

Table  4:  LSI  Adhoc  Results.  Comparison  of 
standard  vector  method  with  LSI  using  only 
documents  for  which  relevance  judgements  were 
available. 

The  most  strikmg  aspect  of  these  results  is  the  higher 
overall  levels  of  performance.  This  is  to  be  expected 
since  we  are  only  considering  the  38175  documents 
for  which  we  have  relevance  judgements,  and  there 
are  700k  fewer  documents  than  in  the  official  results. 
Considering  only  this  subset  of  documents,  there  is  a 
small  advantage  for  LSI  compared  to  the  SMART 
vector  method.  Taken  together  with  the  results  for 
just  the  top  100  documents  these  results  suggest  that 
LSI  can  outperform  a  straightforward  vector  method. 
We  were  somewhat  disappointed  at  the  relatively 
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small  difference  between  LSI  and  a  comparable  vector 
method  in  the  TREC  enviroimient,  given  that  we  have 
consistently  observed  larger  advantages  previously. 
The  most  likely  reason  for  this  is  that  the  TREC  topics 
are  much  richer  and  more  detailed  descriptions  of 
searchers'  needs  than  are  found  in  typical  JR  requests. 
The  average  TREC  query  has  51  unique  words  in  it, 
and  many  of  these  are  very  specific  names.  Since  LSI 
is  primarily  a  recall  enhancing  method  it  has  little 
effect  on  these  already  very  rich  queries.  This  is  much 
the  same  conclusion  that  groups  who  tried  various 
kinds  of  query  expansion  (another  recall  enhancing 
method)  reached  -  e.g.,  see  Voorhees  1993. 

We  tried  one  analysis  using  the  new  "summary"  field 
for  each  topic.  This  field  alone  is  used  as  a  much 
shorter  topic  statement  (often  quite  similar  to  the 
description  field)  that  covers  the  relevance 
assessments.  These  resxilts  are  summarized  in  Table 
5.  As  expected,  overall  performance  is  lower  than 
with  the  complete  topics.  More  interestingly,  the 
difference  between  LSI  and  the  standard  vector 
method  is  now  larger  -  16%  in  average  precision.  This 
is  still  a  somewhat  smaller  advantage  than  we  have 
seen  in  previous  experiments  with  smaller  test 
collections,  but  even  the  summary  queries  have  an 
average  of  1 1  imique  words  in  them. 


Tables 

Isiasm 

Isial* 

38175 

38175 

ReLret 

8043 

8676 

Avg  prec 

.2589 

.3008 

Prat  100 

.3344 

.3710 

Prat  10 

.3420 

.3680 

R-prec 

.3114 

.3386 

Table  5:  LSI  Adhoc  Results  -  Summary  Topics. 
Comparison  of  standard  vector  method  with  LSI 
using  only  documents  for  which  relevance 
judgements  were  available,  using  only  the 
Summary  field. 

The  reduced  dimension  LSI  vector  retrieval  method 
can  offer  performance  advantages  compared  to  a 
standard  vector  method  for  the  large  TREC  collection. 
The  advantages  are  larger  with  shorter  queries,  as 
expected.  The  exact  nature  of  this  advantage  (e.g., 
which  documents  are  retrieved  by  LSI  but  not  the 
standard  vector  method)  needs  to  be  examined  in  more 
detail. 


4.  Improving  performance 

4.1  Improving  performance  -  Speed 

The  LSI/SVD  system  was  built  as  a  research  prototype 
to  investigate  many  different  information  retrieval  and 
interface  issues.  Retrieval  efficiency  was  not  a  central 
concern  because  we  first  wanted  to  assess  whether  the 
method  worked  before  worrying  about  efficiency,  and 
because  the  initial  appUcations  of  LSI  involved  much 
smaller  databases  of  a  few  thousand  documents. 
Almost  no  effort  went  into  re-designing  the  tools  to 
work  efficientiy  for  the  large  TREC  databases. 

4.1.1  SVD 

SVD  algorithms  get  faster  all  the  time.  The  sparse, 
iterative  algorithm  we  now  use  is  about  100  times 
faster  than  tiie  method  we  used  initially  (Berry,  1992). 
There  are  the  usual  speed-memory  tradeoffs  in  die 
SVD  algorithms,  so  time  can  probably  be  decreased 
some  by  using  a  different  algorithm  and  more 
memory.  Parallel  algoritimis  will  help  a  littie,  but 
probably  only  by  a  factor  of  2.  Finally,  all 
calculations  are  now  done  in  double  precision,  so  both 
time  and  memory  could  be  decreased  by  using  single 
precision  computations.  Preliminary  experiments  with 
smaller  IR  test  collections  suggest  that  this  decrease  in 
precision  will  not  lead  to  numerical  problems  for  the 
SVD  algoritiim.  It  is  important  to  note  diat  the  pre- 
processing and  SVD  analyses  are  one-time-only  costs 
for  relatively  stable  domains. 

4.1.2  Retrieval 

Query  vectors  are  compared  to  every  document.  This 
process  is  linear  in  the  number  of  documents  in  the 
database,  and  can  be  qmte  slow  for  large  databases. 
Although  there  are  no  practical  and  efficient 
algorithms  for  finding  nearest  neighbors  in  200-  or 
300-dimensimal  spaces,  there  are  several  methods 
which  could  be  used  to  speed  retrieval,  a)  Document 
clustering  could  be  used  to  reduce  the  number  of 
comparisons,  but  accuracy  would  probably  suffer 
some.  We  have  explored  several  heuristics  for 
clustering,  but  none  are  particularly  effective  when 
high  levels  of  accuracy  are  maintained,  b)  HNC's 
MatchPlus  system  (Gallant  et  al.,  1993)  uses  another 
approach  to  reduce  the  number  of  alternative 
documents  that  must  be  matched.  They  use  an  initial 
keyword  match  to  eliminate  many  documents  and 
calculate  reduced-dimension  vector  scores  for  only  the 
subset  of  documents  meeting  the  initial  keyword 
restriction.  This  may  be  a  reasonable  alternative  for 
long  queries  (like  TREQ,  but  we  beUeve  that  recall 
would  be  reduced  for  short  queries.  In  addition  two 


111 


data  structures  need  to  be  maintained,  c)  Query 
matching  can  also  be  improved  tremendously  by 
simply  using  more  than  one  machine  or  parallel 
hardware.  Using  a  16,000  PE  MasPar,  with  no 
attempt  to  optimize  the  data  storage  or  sorting,  we 
decreased  the  time  required  to  match  a  200- 
dimensional  query  vector  against  all  document  vectors 
and  sort  by  a  factor  of  60  to  100. 

4.2  Improving  Performance  -  Accuracy 

We  have  only  begun  to  look  at  a  large  number  of 
parametric  variations  that  might  improve  LSI 
performance.  One  important  variable  for  LSI  retrieval 
is  the  number  of  dimensions  in  the  reduced  dimension 
space.  In  previous  experiments  we  have  found  that 
performance  improves  as  the  number  of  dimensions  is 
increased  up  to  200  or  300  dimensions,  and  decreases 
slowly  after  that  to  the  level  observed  for  the  standard 
vector  metiiod  (Dumais,  1991).  We  have  examined 
TREC-2  performance  using  fewer  dimensions  than 
reported  above  (204  for  the  routing  queries  and  199 
for  the  adhoc  queries)  and  consistently  found  worse 
performance.  Thus,  it  looks  like  we  could  improve 
performance  simply  by  increasing  the  number  of 
dimensions  some.  Unfortunately,  this  requires 
rerunning  the  SVD. 

We  also  noticed  that  many  of  the  adhoc  queries 
contained  "NOTS".  Since  LSI  does  not  use  any 
Boolean  logic  and  represents  a  query  as  the  vector 
sum  of  its  constituent  terms,  we  thought  that  removing 
this  information  might  help.  We  modified  the  topic 
statements  by  hand  to  remove  negated  phrases. 
Performance  improved  by  less  that  2%. 

We  still  need  to  experiment  with  different  term 
weighting  metiiods.  For  the  routing  and  adhoc 
experiments  we  used  SMART'S  "Itc"  weighting  for 
both  the  corpus  of  documents  and  the  queries. 
Buckley  and  Salton's  TREC-1  paper  suggests  that 
alternative  weightings  may  be  more  effective  for  the 
large  TREC  document  collection.  Reweighting  the 
query  vectors  is  easy.  Reweighting  the  document 
collection  is  more  difficult,  because  this  changes  the 
term-document  matrix  and  a  new  SVD  is  required. 

For  the  routiog  queries  we  would  like  to  try  several 
alternative  methods  of  combining  information  from 
the  original  query  and  the  relevant  documents  to  take 
better  advantage  of  the  good  training  data  that  is 
available.  We  expect  term  re-weighting  and  the  use  of 
negative  information  (e.g.,  down  weighting  terms 
from  non-relevant  documents)  to  improve 
performance  some. 


In  order  to  better  understand  retrieval  performance  we 
have  begun  to  examine  two  kinds  of  retrieval  failures: 
false  alarms,  and  misses.  False  alarms  are  documents 
that  LSI  ranks  highly  that  are  judged  to  be  irrelevant. 
Misses  are  relevant  documents  that  are  not  in  the  top 
iOOO  returned  by  LSL 

4.2.1  False  Alarms. 

The  most  common  reason  for  false  alarms  was  lack  of 
specificity.  These  highly  ranked  but  irrelevant  articles 
were  generally  about  the  topic  of  interest  but  did  not 
meet  some  of  the  restrictions  described  in  the  topic 
statement.  Many  topics  required  this  kind  of  detailed 
processing  or  fact-finding  that  the  LSI  system  was  not 
designed  to  address.  Precision  of  LSI  matching  can  be 
iucreased  by  many  of  the  standard  techniques  -  proper 
noim  identification,  use  of  syntactic  or  statistically- 
derived  phrases,  or  a  two-pass  approach  involving  a 
standard  initial  global  matching  followed  by  a  more 
detailed  analysis  of  the  top  few  thousand  documents. 
Buckley  and  Salton  (1992,  SMART'S  global  and  local 
matching),  Evans  et  al.  (1992,  (XARTT's  evoke  and 
discriminate  strategy).  Nelson  (1992,  ConQuest's 
global  match  followed  by  the  use  of  locahty  of 
information),  and  Jakobs,  Krupka  and  Rau  (1992, 
GE's  pre-filter  followed  by  a  variety  of  more  stringent 
tests)  all  used  two-pass  approaches  to  good  advantage 
in  TREC-1  or  TREC-2.  We  would  like  to  try  some  of 
these  methods,  and  will  focus  on  general-purpose, 
completely  automatic  methods  that  do  not  have  to  be 
modified  for  each  new  domain  or  query  restriction. 

Another  possible  reason  for  false  alarms  appears  to  be 
the  result  of  inappropriate  query  pre-processing.  The 
use  of  negation  is  the  best  example  of  this  problem. 
32  of  50  adhoc  queries  contain  some  negation  in  the 
topic  statement.  Some  preliminary  experiments 
(described  briefly  above)  found  only  a  small 
improvement  in  performance  when  negated 
information  was  manually  removed  from  the  topics. 
Another  example  of  inappropriate  query  processing 
involved  the  use  of  logical  connectives.  LSI  does  not 
handle  Boolean  combinations  of  words,  and  often 
returned  articles  covering  only  a  subset  of  ANDed 
topics.  Often  one  aspect  of  the  query  appears  to 
dominate  (typically  the  one  described  by  the  terms 
with  high  weights).  Limiting  the  contribution  of  any 
one  term  to  the  overall  similarity  score  might  help  this 
problem. 

Finally,  it  is  not  at  all  clear  why  about  20%  of  the  false 
alarms  were  returned  by  LSI.  Since  LSI  uses  a 
statistically-derived  "semantic"  space  and  not 
surface-level  word  overlap  for  matching  queries  to 
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documents,  it  is  sometimes  difficult  to  understand  why 
a  particular  document  was  returned.  One  advantage  of 
the  LSI  mediod  is  that  docimients  can  match  queries 
even  when  they  have  no  words  in  common;  but  this 
can  also  produce  some  spurious  hits.  Another  reason 
for  false  alarms  could  be  inappropriate  word  sense 
disambiguation.  LSI  queries  are  located  at  the 
weighted  vector  sum  of  the  words,  so  wcffds  are 
"disambiguated"  to  some  extent  by  the  other  query 
words.  Similarly,  the  initial  SVD  analysis  used  the 
context  of  other  words  m  articles  to  determine  the 
location  for  each  word  in  the  LSI  space.  However, 
since  each  word  has  only  one  location,  it  sometimes 
appears  as  if  it  is  "in  the  middle  of  nowhere".  A 
related  possibility  concerns  long  articles.  Lengthy 
articles  which  talk  about  many  distinct  subtopics  were 
averaged  into  a  single  document  vector,  and  this  can 
sometimes  produce  spurious  matches.  Breaking  larger 
documents  into  smaller  subsections  and  matching  on 
these  might  help. 

4.2.2  Misses. 

For  this  analysis  we  will  examine  a  random  subset  of 
relevant  articles  that  were  not  in  the  top  1000  returned 
by  LSI.  Many  of  the  relevant  articles  were  fairly 
highly  ranked  by  LSI,  but  there  were  also  some 
notable  failures  that  would  be  seen  only  by  the  most 
persistent  readers.  So  far,  we  have  not  systematically 
distinguished  between  misses  tiiat  "almost  made  it" 
and  those  that  were  much  further  down  the  Ust. 

Most  of  the  misses  we  examined,  represent  articles 
that  were  primarily  about  a  different  topic  than  the 
query,  but  contained  a  small  section  that  was  relevant 
to  the  query.  Because  documents  are  located  at  the 
average  of  their  terms  in  LSI  space,  they  will 
generally  be  near  the  dominant  theme,  and  this  is  a 
desirable  feature  of  the  LSI  representation.  Some  kind 
of  local  matching  should  help  m  identifying  less 
central  themes  in  documents. 

Some  misses  were  also  attributable  to  poor  text  (and 
query)  pre-processing  and  tokenizatioa 

4.3  Open  issues 

On  the  basis  of  preliminary  failure  analyses  we  would 
like  to  exploring  some  precision-enhancing  methods. 
We  would  also  like  to  explore  three  additional  areas. 

4.3.1  Separate  vs.  combined  scaling 

We  used  9  separate  subscalings  for  the  TREC-1 
experiments.  For  TREC-2  we  used  a  single  scaling 


(based  on  a  very  small  sample).  We  have  also 
recently  finished  a  complete  scaling  and  will  compare 
this  with  the  subcollection  scalings  and  tiie  sampled 
full  scaling. 

4.3.2  Centroid  query  vs.  many  separate 
points  of  interest 

A  single  vector  was  used  to  represent  each  query.  In 
some  cases  the  vector  was  the  average  of  terms  in  the 
topic  statement,  and  in  other  cases  the  vector  was  die 
average  of  previously  identified  relevant  doamients. 
A  single  query  vector  can  be  inappropriate  if  interests 
are  multifacted  and  these  facets  are  not  near  each 
other  in  the  LSI  space.  We  have  developed  techniques 
tiiat  allow  us  to  match  using  a  controllable 
compromise  between  averaged  and  separate  vectors 
(Kane-Esrig  et  al.,  1991).  In  the  case  of  the  routing 
queries,  for  example,  we  could  match  new  documents 
against  each  of  the  previously  identified  relevant 
documents  separately  rather  than  against  their 
average. 

4.3.3  Interactive  interfaces 

All  LSI  evaluations  were  conducted  using  a  non- 
interactive  system  in  essentially  batch  mode.  It  is  well 
known  that  one  can  have  tiie  same  imderlying  retrieval 
and  matching  engine,  but  achieve  very  different 
retrieval  success  using  different  interfaces.  We  would 
like  to  examine  the  performance  of  real  users  with 
interactive  interfaces.  A  number  of  interface  features 
could  be  used  to  help  users  make  faster  (and  perhaps 
more  accurate)  relevance  judgements,  or  to  help  them 
explicitiy  reformulate  queries.  (See  EHmiais  and 
Schmitt,  1991,  for  some  preliminary  results  on  query 
reformulation  and  relevance  feedback.)  Another 
interestmg  possibility  involves  returning  something 
richer  than  a  rank-ordered  Ust  of  doamients  to  users. 
For  example,  a  clustering  and  graphical  display  of  the 
top-k  documents  might  be  quite  useful.  We  have  done 
some  preliminary  experiments  using  clustered  return 
sets,  and  would  like  to  extend  this  work  to  the  TREC 
collections. 

The  general  idea  is  to  provide  people  witii  useful 
interactive  tools  that  let  them  make  good  use  of  their 
knowledge  and  skills,  rather  than  attempting  to  build 
all  the  smarts  into  the  database  representation  or 
matching  components  of  the  system. 
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5.  Onward  to  TREC-3 

We  were  quite  pleased  that  we  were  able  to  use  many 
of  the  existing  LSI/SVD  tools  on  the  TOEC-l  and 
TREC-2  collections.  The  most  important  finding  in 
this  regard  was  that  the  large,  sparse  SVD  problems 
could  be  computed  widiout  numerical  or  convergence 
problems.  We  modified  the  preprocessing 
substantially  for  TREC-2,  now  have  many  of  the  basic 
tools  in  place  and  should  be  able  to  conduct  more 
experiments  comparing  various  indexing  and  query 
matching  ideas  using  the  same  underlying  LSI  engine. 

Bigger  SVDs,  faster  query  matching,  improving 
precision,  and  interactive  interface  issues  are  the 
major  areas  targeted  for  improvement. 
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7.  Appendix 

Latent  Semantic  Indexing  (LSI)  uses  singular-value 
decomposition  (SVD),  a  technique  closely  related  to 
eigenvector  decomposition  and  factor  analysis 
(Cullum  and  Willoughby,  1985).  We  take  a  large 
term-document  matrix  and  decompose  it  into  a  set  of 
it ,  typically  100  to  300,  orthogonal  factors  from  which 
the  original  matrix  can  be  approximated  by  linear 
combination. 

More  formally,  any  rectangular  matrix,  X,  for 
example  atxd  matrix  of  terms  and  documents,  can  be 
decomposed  into  the  product  of  three  other  matrices: 

X  =  To'So'Dq', 

rxr  rxr  rxd 

such  that  To  and  Do  have  orthonormal  columns,  .So  is 
diagonal,  and  r  is  the  rank  of  X.  This  is  so-called 
singular  value  decomposition  of  X . 

If  only  the  k  largest  singular  values  of  So  are  kept 
along  with  their  corresponding  columns  in  the  To  and 
Do  matrices,  and  the  rest  deleted  (yielding  matrices  S , 
T  and  D ),  the  resulting  matrix,  X ,  is  the  imique  matrix 
of  rank  k  that  is  closest  in  the  least  squares  sense  to  X : 

X^^X  =  T-S-p' 

The  idea  is  that  the  X  matrix,  by  containing  only  the 
first  k  independent  linear  components  of  X ,  captures 
the  major  associational  structure  in  the  matrix  and 
throws  out  noise.  It  is  this  reduced  model,  usually 
with  ^:  «  100,  that  we  use  to  approximate  the  term  to 
docimient  association  data  in  X .  Since  the  number  of 
dimensions  in  the  reduced  model  (k)  is  much  smaller 
than  the  number  of  unique  terms  (t),  minor 
differences  in  terminology  are  ignored.  In  this 
reduced  model,  the  closeness  of  documents  is 
determined  by  the  overall  pattern  of  term  usage,  so 
documents  can  be  near  each  other  regardless  of  the 
precise  words  that  are  used  to  describe  them,  and  their 
description  depends  on  a  kind  of  consensus  of  their 
term  meanings,  thus  dampening  the  effects  of 
polysemy.  In  particular,  this  means  that  documents 
which  share  no  words  with  a  user's  query  may  stUl  be 
near  it  if  that  is  consistent  with  the  major  patterns  of 
word  usage.  We  use  the  term  "semantic"  indexing  to 
describe  our  method  because  die  reduced  SVD 
representation  captures  the  major  associative 
relationships  between  terms  and  documents. 

One  can  also  interpret  the  analysis  performed  by  SVD 
geometrically.  TTie  result  of  the  SVD  is  a  k- 
dimensional  vector  representing  the  location  of  each 


term  and  doomient  in  the  /: -dimensional 
representation.  The  location  of  term  vectors  reflects 
the  correlations  in  their  usage  across  documents.  In 
this  space  the  cosine  or  dot  product  between  vectors 
corresponds  to  their  estimated  similarity.  Since  both 
term  and  document  vectors  are  represented  in  the 
same  space,  similarities  between  any  combination  of 
terms  and  documents  can  be  easily  obtained. 
Retrieval  proceeds  by  using  the  terms  in  a  query  to 
identify  a  point  in  the  space,  and  all  documents  are 
then  ranked  by  their  similarity  to  the  query.  We  make 
no  attempt  to  interpret  the  underlying  dimensions  or 
factors,  nor  to  rotate  them  to  some  intuitively 
meaningful  orientation.  The  analysis  does  not  require 
us  to  be  able  to  describe  the  factors  verbally  but 
merely  to  be  able  to  represent  terms,  documents  and 
queries  in  a  way  that  escapes  the  unreliabiUty, 
ambiguity  and  redundancy  of  individual  terms  as 
descriptors. 

(Hioosing  the  appropriate  number  of  dimensions  for 
the  LSI  representation  is  an  open  research  question. 
Ideally,  we  want  a  value  of  k  that  is  large  enough  to  fit 
aU  the  real  structure  in  the  data,  but  small  enough  so 
that  we  do  not  also  fit  the  sampling  error  or 
unimportant  details.  If  too  many  dimensions  are  used, 
the  method  begins  to  approximate  standard  vector 
methods  and  loses  its  power  to  represent  die  similarity 
between  words.  E  too  few  dimensions  are  used,  there 
is  not  enough  discrimination  among  similar  words  and 
documents.  We  find  that  performance  improves  as  k 
increases  for  a  while,  and  then  decreases  (Dumais, 
1991).  That  LSI  typically  works  well  with  a  relatively 
small  (compared  to  the  number  of  unique  terms) 
number  of  dimensions  shows  that  these  dimensions 
are,  in  fact,  capturing  a  major  portion  of  the 
meaningful  structure. 
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I.  INTRODUCTION 

For  many  years,  research  on  information  retrieval  was 
mostly  confined  to  a  few  relatively  small  test  collections  such 
as  the  Cranfield  collection  [1],  die  NPL  Collection  [2],  and  the 
CACM  Collection  [3].  Over  the  years,  results  on  diose  collec- 
tions accumulated,  with  the  aim  of  determining  which  tech- 
nique or  combination  of  techniques  resulted  in  the  best 
precision/recall  figures  on  those  collections.  Gradually,  a 
"standard  model"  more-or-less  emerged:  for  the  test  collec- 
tions under  study,  consistently  good  results  are  obtained  by 
vector-model  retrieval  using  a  cosine  similarity  measiue,  tf.idf 
weighting,  and  a  stemming  algorithm  (e.g.  Chapter  9  of  [4], 
[5]).^  Out-performing  this  model  on  the  old  test  collections 
has  proved  extremely  difficult.  This  has  led  to  a  danger  of 
stagnation  in  the  field  of  IR,  and  a  feeling  that  the  majority  of 
what  can  be  learned  from  precision-recall  experiments  on  the 
old  collections  has  been  learned. 

Fortunately  for  the  field,  a  number  of  recent  developments 
have  led  to  new  challenges  for  the  field.  Among  tiiese: 

•  The  Hpster  project  [6]  has  led  to  the  construction  of  a  test 
collection  of  unprecedented  size  [7].  This  leads  to  chal- 
laiges  related  to  scaling  up  IR  metiiods. 

•  IR  methods  are  now  being  used  on  full  documents  (e.g. 
news  stories),  rather  than  on  abstracts.  This  leads  to  new 
challenges  relating  to  the  structure  of  laige  documents, 
particularly  effects  relating  to  term  proximity. 

•  The  on-line  database  vendors  have  shown  interest  in  full- 
text  IR.  This  leads  to  challenges  relating  to  the  integration 
of  IR  methods  with  traditional  boolean  methods  which,  in 
the  operational  environment,  must  continue  to  be  sup- 
ported if  only  for  the  benefit  of  the  existing,  entrenched 
user  community. 

1.  Other  ^preaches  yield  results  which  are  nearly  as  good  and,  in 
rare  cases  better.  Approaches  based  on  Bayesian  inference  networks 
are  particularly  notable  in  that  respect.  However,  the  above  model 
has  come  to  be  accepted  as  a  de-facto  standard  against  which  new 
methods  are  measured. 
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•  Network-oriented  IR  protocols  have  become  increasingly 
popular  as  a  basic  tool  for  navigating  across  die  network. 
This  leads  to  the  challenge  of  designing  systems  which 
give  uniformly  interpretable  results  across  a  distributed 
database. 

The  work  reported  in  this  paper  represents  some  steps  along 
die  way  to  solving  Uiese  problems.  In  essence,  the  challenge 
we  are  facing  is  to  design  a  system  which  delivers  high  preci- 
sion-recall figures  on  large  databases  of  long,  complex  docu- 
ments. Furthermore,  the  system  must  be  compatible  with 
existing  boolean  retrieval  methods,  and  it  must  be  possible  to 
use  this  system  in  a  distributed  database  in  a  manner  which 
yields  consistentiy  interpretable  results.  The  specific  steps 
reported  at  this  point  consist  of  the  following: 

•  A  database  architecture  which  is  suitable  for  use  on  large 
collections  of  structured  documents,  which  supports  both 
base-line  IR  and  boolean  retrieval  mediods. 

•  An  in-memory  implementation  of  this  architecture  on  a 
massively  parallel  computer  (the  Connection  Machine 
model  CM-5),  which  is  used  as  test-bed. 

•  Precision-Recall  figures  for  this  test-bed  applied  to  die 
Tipster  collection,  as  part  of  TREC. 

It  must  be  emphasized  that  this  work  is  in  an  early  stage 
and,  at  this  point,  has  not  reached  the  point  where  tiiese  medi- 
ods are  demonstrable  on  extremely  large  databases.  Further- 
more, work  relating  to  takmg  advantage  of  document  structure 
has  only  barely  begun. 

n.  DATABASE  ARCHITECTURE 

A.  Choice  of  a  Representation 

There  are  three  basic  approaches  to  representing  databases 
for  retrieval  applications: 

•  Text  Files.  In  diis  method,  the  database  is  stored  in  full- 
text  form,  and  scanned  sequentially.  Such  mediods  are 
often  implemented  by  special-purpose  hardware  [8]. 

•  Signature  Files.  In  tiiis  family  of  methods,  a  compressed 
surrogate  for  the  text  file  is  searched  instead  of  the  text  file 
itself.  Overlap  encoding  is  usually  employed  to  constiuct 
the  surrogate.  There  are  many  variants  on  signature  files 
[9]. 

•  Inverted  Files.  In  this  family  of  methods,  an  index  con- 
taining the  locations  of  every  word  in  the  database  is  con- 
structed. The  search  is  then  accompUshed  by  reference  to 
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the  index.  Again,  there  are  many  variants  on  inverted  file 
representations  [10]. 

Inverted  files  have  proved  the  only  technique  which  sup- 
ports interactive  access  to  very  large  databases;  this  is  prima- 
rily due  to  the  excessive  I/O  requirements  of  the  other 
methods.  Within  the  family  of  inverted  file  methods,  several 
variants  are  possible: 

•  Simple  Inverted  Files.  In  a  simple  inverted  file,  the  index 
consists  of  a  list  of  documents  in  which  each  word  occurs. 

•  Inverted  Files  with  Weights.  In  an  inverted  file  with 
weights,  the  above  information  is  supplemented  with 
weights,  in  support  of  vector  retrieval  methods. 

•  Structured  Inverted  Files.  In  a  structured  inverted  file, 
the  index  captures  structural  information  (typically  the 
paragraph/sentence/word  coordinates  of  each  occiarrence 
of  each  term),  in  support  of  boolean  retrieval  methods 
which  rely  on  proximity  operators  (e.g.  word  adjacency). 

In  this  research,  we  have  decided  to  explore  the  use  of 
structured  inverted  files  for  the  following  reasons: 

•  They  support  die  proximity  operations  required  as  part  of 
boolean  retrieval  systems.  This  support  makes  architec- 
tures based  on  structured  inverted  files  more  likely  to  be 
adopted  by  the  major  on-line  vendors. 

•  They  provide  a  fundamentally  richer  representation  of 
document  structure  than  is  available  with  the  other  meth- 
ods. 

•  They  are  collection-independent  and  retrieval-method 
independent. 

This  last  point  requires  some  explanation.  In  a  distributed 
environment,  it  would  be  useful  to  be  able  to  search  text  files 
at  multiple  locations  as  if  they  were  a  single  text  file.  Once 
weights  —  which  are  both  collection-dependent  and  retrieval- 
method  dependent  —  are  incorporated  into  the  index,  trans- 
parent distributed  access  becomes  impossible.  The  product  of 
this  is  that  the  results  of  a  single  query  applied  to  multiple 
databases  cannot  be  meaningfully  combined,  since  there  is  no 
way  to  compare  the  ranks  or  scores  applied  to  the  documents 
returned. 

One  of  the  primary  challenges  associated  with  the  use  of 
structured  inverted  files  is  their  bulk:  there  are  approximately 
as  many  index  entries  in  the  inverted  file  as  there  are  words  in 
the  database,  and  each  entry  must  represent  a  document,  para- 
graph, sentence,  and  word  coordinates  for  that  word.  This 
problem  has  been  substantially  solved  by  use  of  a  novel  com- 
bination of  compression  techniques  [11],  which  allow  struc- 
tured indexes  having  a  bulk  on  the  order  of  1/3  the  size  of  the 
full  text  to  be  constructed. 

The  second  challenge  associated  widi  the  use  of  structured 
inverted  files  is  to  implement  the  standard  information 
retrieval  model  using  them.  The  results  of  this  effort  are 
reported  in  diis  paper. 


The  final  challenge  associated  with  structured  inverted  files 
is  using  them  to  implement  methods  which  go  beyond  the 
standard  model,  taking  into  accoimt  the  added  richness  the 
structured  representation  to  improve  retrieval  system  perfor- 
mance for  databases  having  long,  structured  documents.  This 
final  challenge  remains  a  topic  for  future  research,  and  there 
are  no  results  to  report  at  this  time. 

B.  The  Database  Architecture 

The  database  consists  of  the  following  structures: 

•  A  Compressed  Structured  Posting  File.  The  first  compo- 
nent of  the  database  is  an  array  of  compressed  postings. 
Each  posting  gives  the  location  of  a  word  expressed  as  a 
document-paragraph-sentence-word  4-tuple.  The  postings 
are  sorted  by  word  ID,  in  ascending  document  order. 

•  A  Lexicon/Index.  The  second  component  of  the  database 
is  a  lexiconMdex.  For  each  word  in  the  database,  it  stores 
the  number  of  times  it  occurs  plus  the  location  of  its  post- 
ings in  the  posting  file.  The  lexicon  is  structured  so  that  the 
information  pertaining  to  a  given  word  may  be  quickly 
located. 

•  Document  Information.  The  final  component  of  the  data- 
base is  an  array  of  document-related  information.  Each 
entry  contains  a  mapping  from  an  internal  document  iden- 
tifier (an  integer)  to  an  external  document  identifier  (a 
string).  Additional  information,  such  as  the  length  of  the 
document  or  normalization  information,  may  be  added  as 
necessary. 

The  following  operations  are  supported: 

•  Adding  Postings.  A  5-tuple  consisting  of  a  word  plus  its 
four  coordinates  may  be  added  to  the  database. 

•  Extracting  Postings.  A  word  is  presented  to  the  database. 
An  array  having  the  decompressed  coordinates  of  all 
occurrences  of  diat  word  is  returned.  The  array  is  sorted  by 
ascending  document  coordinate,  with  paragraph,  sentence, 
and  word  coordinates  used  as  additional  sort  keys  as 
needed. 

•  Extracting  Lexicon  Information.  Lexicon  information 
such  as  the  number  of  occurrences  of  a  word  may  be 
extracted. 

•  Setting  Lexicon  Information.  Supplemental  lexicon 
information  (e.g.  part-of-speech  information)  may  be 
added  to  the  lexicon.  This  feature  is  not  used  in  the  work 
reported  herein. 

•  Defining  a  Document.  Defines  document-specific  infor- 
mation such  as  external  docimient  identifier,  document 
location  on  disk,  etc.,  and  associates  it  with  a  document 
coordinate. 

•  Extracting  Document  Information.  Extracts  the  docu- 
ment-specific information  mentioned  above,  given  a  docu- 
ment coordinate. 
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We  believe  that  this  architecture  suffices  to  implement 
boolean  retrieval  methods,  standard  IR  methods,  and  novel 
methods  making  use  of  document  structure,  and  that  it  can  be 
scaled  to  very  large  databases  and  to  massively  parallel  com- 
puters. 

C.  The  Test-bed 

The  above  architecture  has  been  implemented  in  test-bed 
form  on  the  Connection  Machine  model  CM-5.  The  purpose 
of  this  implementation  is  to  permit  various  retrieval  methods 
to  be  evaluated,  rather  than  to  support  on-line  services  on  very 
large  files.  As  such,  the  focus  was  on  simplicity  of  implemen- 
tation, speed  of  database  construction,  and  speed  of  query  exe- 
cution, rather  than  on  handling  files  larger  than  a  few 
Gigabytes. 

The  CM-5  [12]  is  a  general-purpose  parallel  computer  con- 
structed from  commodity  microprocessors.  Each  processing 
node  consists  of  a  SPARC  microprocessor,  32  megabytes  of 
RAM,  and  a  network  interface.  The  CM-5  has  two  user-acces- 
sible networks:  a  Data  Router,  which  is  used  for  point-to- 
point  transmission  of  data  packets,  and  a  Control  Network 
which  is  used  to  unplement  global  operations  such  as  barrier 
synchronization,  broadcast,  and  global-maximum.  The  data 
router  provides  each  processor  with  5  MB/second  (full 
duplex)  worth  of  point-to-point  bandwidth.  The  control  net- 
work is  constructed  using  a  fan-in/fan-out  tree  so  that  global 
operations  complete  within  a  few  microseconds  of  their  initia- 
tion. 

The  CM-5  I/O  system  consists  of  a  high-performance  mass 
storage  system  called  the  scalable  disk  array  (SDA).  la  this 
system,  disk  controllers  are  connected  directly  to  the  CM-5's 
data  router.  This  allows  all  processors  to  have  equal  access  to 
all  disks  in  the  system,  providing  the  image  of  a  scalable 
shared  disk  environment.  The  file  system  implements  a  UNIX 
file  system  on  top  of  this  hardware,  such  that  file  systems  may 
be  striped  across  up  to  256  disks  [13].  The  result  is  a  file  sys- 
tem which  can  sustain  transfer  rates  exceeding  150  MB/sec- 
ond in  large  configurations. 

The  basic  approach  taken  in  this  implementation  is  to  begin 
by  partitioning  the  document  set  among  the  processors.  This  is 
done  by  having  each  processor  read  a  fixed-size  contiguous 
chunk  (1  MB)  of  data  from  the  input  file.  In  general,  this  will 
result  in  some  documents  spaiming  processor  boundaries,  so 
document  fragments  will  then  be  re-distributed.  Once  this  is 
done,  each  processor  parses  and  lexes  its  own  documents, 
using  conventional  programming  techniques.  The  postings  are 
then  inserted  into  the  mverted  file. 

One  novel  implementation  trick  is  used  in  this  phase  of  pro- 
cessing: rather  than  sorting  the  postings,  which  would  be  very 
time  consuming  owing  to  the  size  of  the  imcompressed  post- 
mgs,  the  database  is  indexed  in  two  passes.  On  the  first  pass, 
called  the  "dry  run",  the  postings  are  generated,  counted,  and 
discarded.  At  the  end  of  this  pass,  the  system  knows  how 
much  space  is  required  to  store  the  posting  list  for  each  word. 
Space  is  then  allocated  and  carved  up.  The  second  "produc- 


tion" phase  then  begins.  The  database  is  scanned,  lexed.  and 
indexed  again  from  scratch,  but  this  time  the  postings  are 
compressed  and  stored  into  the  space  allocated  at  the  end  of 
the  dry  run.  This  strategy  doubles  indexiag  time  but,  by  elimi- 
nating the  expense  of  sorting  the  posting  file,  it  ends  up  both 
simplifying  the  software  and  reducing  overall  database  con- 
struction time. 

At  the  end  of  this  phase,  the  data  structures  noted  above 
(posting  file,  lexicon/index,  and  document  information)  have 
been  constructed  in-memory,  and  the  database  is  ready  for 
querying.  Using  these  methods  on  a  64  processor  CM-5,  the 
TREC-2  training  database  (2.2  Gigabytes)  required  20  min- 
utes to  index.  The  speed  of  the  indexing  software  permits  the 
database  to  be  re-indexed  for  every  retrieval  experiment, 
allowing  both  indexing  methods  and  query  methods  to  be  con- 
veniently explored.  Ignoring  stop  words,  the  size  of  the  com- 
pressed inverted  file  index  for  the  TREC  database  is  24%  of 
the  raw  text.  Details  of  the  compression  algorithm  can  be 
found  in  [11]. 

The  first  step  in  query  evaluation  is  to  broadcast  the  query 
to  all  processors.  In  a  boolean  system,  the  query  would  then 
be  processed  locally  by  each  processor,  then  the  resiilts  pro- 
duced by  the  processors  concatenated  to  return  the  final 
answer. 

In  an  information  retrieval  system  based  on  term  weighting 
and  docmnent  ranking,  slightiy  more  work  is  required.  First, 
query-term  weights  are  generally  based  on  some  sort  of  term- 
frequency  observations.  These  cannot  be  done  locally,  but 
require  the  use  of  some  simple  global  operations.  For  exam- 
ple, to  determine  the  number  of  documents  in  which  a  word 
occurs,  each  processor  would  count  document  occurrences  for 
itself,  then  the  global  sum  operation  (supported  in  hardware 
by  the  Control  Network)  would  be  used  to  produce  a  machine- 
wide  count.  The  second  problem  which  arises  comes  after  the 
documents  have  been  scored:  an  algorithm  is  required  to 
extract  the  highest-ranking  documents  in  the  collection.  Sev- 
eral parallel  algoridims  for  this  task  have  been  described  in 
[14].  The  test-bed  used  a  variant  on  the  iterative  extraction 
algorithm:  each  processor  locally  determines  its  highest-rank- 
ing documents,  then  repeated  application  of  the  global  maxi- 
mum operation  are  used  to  find  the  best  docmnents  in  the 
collection. 

m.  THE  COSINE  SIMILARITY  MEASURE 

Results  of  using  die  classical  cosine  similarity  measure  on 
the  TREC  collection  have  already  been  reported  elsewhere 
[15],  so  those  results  have  not  been  replicated.  This  section 
will  briefly  describe  how  the  cosine  measure  may  be  imple- 
mented within  our  architecture. 

Cosine  similarity  measures  generally  involve  constructing 
an  inverted  file  incorporating  document- term  weights.  The 
structured  inverted  file  architectwe  does  not  provide  this 
information,  pardy  because  it  is  inconsistent  with  the  struc- 
txu-ed  representation,  and  partly  because  of  the  difficulties 
noted  above  with  regard  to  distributed  databases.  Except  for 
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the  cosine  measure's  document  normalization  factor,  we  find 
that  it  is  possible  to  implement  docimient-term  weighting 
based  only  in  information  in  the  stnictured  posting  file. 

In  general,  term  weights  (both  in  queries  and  docmnents) 
may  be  computed  based  on  the  following  factors  [5],  all  of 
which  are  available  at  run-time  in  our  architecture. 

•  Term  Frequency.  This  is  the  number  of  times  a  term 
occurs  in  the  database.  This  information  is  retained  in  the 
lexicon. 

•  Document-Term  Frequency.  This  is  the  number  of  docu- 
ments in  which  a  term  occurs.  It  may  be  computed  at  run- 
time by  traversing  posting  hst  for  the  term,  coimting  initial 
occvurences  of  the  term. 

•  In-Document  Frequency.  This  is  the  number  of  times  a 
term  occurs  in  a  given  document.  This  may  be  computed  at 
run-time  by  traversing  the  posting  list  for  the  term,  coimt- 
ing the  number  of  occurrences  of  the  term  having  the  same 
document  coordinate. 

•  Maximum  In-Document  Frequency.  This  is  the  maxi- 
mum number  of  times  a  given  term  occurs  in  any  docu- 
ment. It  may  be  computed  directly  from  the  iu-document 
frequencies. 

Most  conventional  weighting  schemes  (e.g.  tf.idf)  may  be 
computed  from  these  run- time  factors.  The  advantage  of 
doing  these  computations  at  run-time  is  that  it  eliminates  die 
need  to  incorporate  collection- specific  information  into  the 
database  at  indexing  time.  This  is  important  for  dynamic  col- 
lections as  well  as  for  distributed  databases,  as  described 
above. 

The  difficulty  with  these  methods  when  applied  to  the 
cosine  norm  is  that  the  total-document-weight  term  cannot  be 
conveniently  computed  at  run-time  using  an  inverted  file.  It 
must,  therefore,  be  computed  at  mdex  time  and  stored  sepa- 
rately (e.g.  in  the  document-specific  information  data  struc- 
ture). 

There  are  two  solutions  to  this  difficulty.  The  first  is  to 
insist  on  using  the  cosine  norm,  accepting  the  difficulties  that 
this  implies.  The  second  is  to  look  for  alternatives  to  the 
cosine  norm.  We  believe  that  this  is  the  more  promising 
approach,  but  this  is  in  the  realm  of  work-not-yet-completed. 
In  any  event,  the  value  of  the  cosine  norm  for  large  structured 
documents  has  not  been  established  at  this  time. 

IV.  RETRIEVAL  EXPERIMENTS: 

Most  of  the  time  (about  3  to  4  person  months)  for  our 
TREC  participation  was  spent  on  building  the  test  bed.  How- 
ever we  did  finish  some  runs  with  automatically  constructed 
routing  and  adhoc  queries  for  which  we  report  the  results  here. 
The  experiments  were  done  using  the  entire  collection.  We 
compare  words  and  phrases,  mixed  case  vs.  lower  case  and 
also  explore  document  length  normalization  and  proximity 
measures. 


A.  Retrieval  for  Routing  Topics: 

All  queries  were  constructed  and  optimized  automatically. 
The  terms  consisted  of  words  (ignoring  stop  words)  as  well  as 
all  phrases  consisting  of  adjacent  words.  Numbers  were 
ignored  and  case  was  preserved.  No  stemming  or  query 
expansion  with  thesauri  etc.  was  attempted. 

Routing  queries  were  constructed  by  looking  at  each  word 
and  (adjacent)  phrase  from  the  whole  text  of  the  topic  tem- 
plates and  determining  a  weight  based  on  the  number  of  rele- 
vant documents  present  in  the  first  100  retrieved  documents, 
by  using  just  that  term.  An  initial  weighted  query  was  con- 
structed for  each  topic  by  the  above  process.  Then  each  topic 
query  was  optimized  by  choosing  thresholds  (per  topic)  for 
the  weights  and  rejecting  all  weighted  query  terms  below  the 
threshold.  The  optimmn  threshold  for  each  topic  was  chosen 
by  straightforward  incremental  search.  Table  1  shows  the 
results  for  the  routmg  experiments.  Routing  queries  widi  both 

Table  1:  Routing  Queries 


Method 

Precision  at 
100  docs 

Average 
precision 

tmc6-  routing-words- 
phrases 

.3396 

.2553 

tmc7 -routing-words 

.2920 

.2045 

tmc6-routing-words- 
phrases-ip 

.3750 

.2716 

tmc6-routing-words- 
phrases-doc-length- 
sent-prox 

.3782 

.2792 

tmc6-routing-words- 
phrases-ip-sent-prox 

.3856 

.3344 

weighted  words  and  phrases  (queryid  tmc6)  did  better  than 
queries  using  just  words  (queryid  tmc7).  Using  the  same  (offi- 
cial) queries,  but  adding  sentence  level  proximity  (sent-prox), 
document  length  scaling  (doc-length)  and  inverse  weights 
based  on  which  paragraph  the  term  appears  in  (ip),  seem  to 
improve  results  (see  the  next  section  for  more  details  about 
the  techniques). 

B.  Retrieval  for  Adhoc  Topics 

Adhoc  queries  were  automatically  constructed  by  using 
words  and  phrases  from  different  sections  of  the  topic  tem- 
plates and  using  tf.idf  weights  (as  derived  fi-om  the  training 
collection).  The  "best"  sections  for  the  new  topics  were  cho- 
sen by  experimenting  with  the  training  topics.  Queries  derived 
from  the  description-concept  sections  were  used  for  most  of 
the  experiments.  A  threshold  for  weights  was  used  to  select 
terms  for  the  final  queries.  Table  2  shows  die  results  for  the 
adhoc  queries. 
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Table  2:  Adhoc  Queries 


jyiciuou 

Precision 
at  100  docs 

Average 
precision 

Case: 

tmc8-adhoc-dcwp-idf- 
caps-wt 

.2002 

.1276 

tmcS-adhoc-dcwp-idf- 
lower-wt 

.1734 

.1157 

Document-length  and 
inverse-para-scaling: 

tmcS-adhoc-dcwp-idf- 
lower-doc-Iength-wt 

.3308 

.1904 

tmcS-adhoc-dcwp-idf- 
caps-ip-wt 

.3432 

.1939 

tmc8-adlioc-dcwp-idf- 
lower-ip-wt 

.3422 

.2027 

tmc9-adhoc-etwp-idf- 
caps-ip-wt 

.3144 

.1736 

Stemming- 

tmcS-adhoc-dcwp-idf- 
lower-stem-wt 

.1670 

.1152 

tmc8-adhoc-dcwp-idf- 
lower-stem-ip-wt 

.3240 

.1980 

Proximity: 

tmc8-adlioc-dcwp-idf- 
caps-doc-length-sent- 
prox-wt 

.3436 

.2012 

tmc8-adhoc-dcwp-idf- 

lower-doc-length-sent- 

prox-wt 

.3518 

.2146 

tmc8-adhoc-dcwp-idf- 
caps-ip-para-prox-wt 

.2892 

.1681 

tmc8-adlioc-dcwp-idf- 
lower-ip-para-prox-wt 

.3006 

.1772 

tmc8-adlioc-dcwp-idf- 
caps-ip-sent-prox-wt 

.3476 

.1988 

tmc8-ad]ioc-dcwp-idf- 
lower-ip-sent-prox-wt 

.3602 

.2164 
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The  query  tmc8  (dcwp)  consisted  of  words  and  phrases 
from  the  description  and  concept  sections  of  the  topic  tem- 
plates. Query  tmc9  (etwp)  used  words  and  adjacent  phrases 
from  the  entire  topic.  Bold-face  acronyms  emphasize  particu- 
lar experiments  with  case  (caps  and  lower),  sentence  and 
paragraph  level  proximity  (sent-prox  and  para-prox)  docu- 
ment length  scaling  (doc-length),  inverse  weights  based  on 
paragraph  position  (ip)  and  weight  thresholds  (wt).  The  que- 
ries were  not  changed  for  the  different  experiments. 

Idf  weighted  terms  from  the  description  and  concept  sec- 
tions taken  together  (query  tmc8)  seem  to  do  better  than  those 
derived  from  the  entire  topic  (query  tmc9). 

C.  Case 

For  the  adhoc  queries,  we  compared  indexing  with  and 
without  preserving  case  (similar  treatment  for  the  queries). 
Except  for  the  simplest  experiment  with  weight  thresholds, 
converting  everything  to  lower  case  seems  to  yield  compara- 
ble or  better  results  than  upper  case.  A  similar  experiment  for 
routmg  queries  wasn't  attempted  because  that  would  have 
required  reformulating  and  reoptimizing  the  routiag  queries. 

D.  Stemming 

Using  the  Porter  Algorithm  software  for  stemming  from 
[16]  we  experimented  with  stemming  at  index  time  (and  stem- 
ming the  queries).  We  found  diat  stermning  reduces  perfor- 
mance when  compared  with  similar  experiments  using  lower 
case  "  since  the  software  we  had  used  lower  case.  We  are  not 
sure  yet  why  there  is  such  a  decrease. 

E.  Document  length  and  Term  position 

Document  length  scaling  was  used  to  explore  the  effect  of 
emphasizing  shorter  dociunents.  A  linear  decreasing  scaling 
for  longer  documents,  with  a  tail  was  used.  An  mverse  weight 
based  on  the  paragraph  the  term  appears  in,  was  also  explored. 
Both  the  document  length  scaling  and  the  inverse  paragraph 
scaling  increase  performance  significantly  and  seem  to  be 
comparable  to  each  other. 

F.  Proximity 

The  postings  for  the  inverted  file  allow  use  of  term  position. 
Experiments  are  underway  to  define  proximity  scoring  meth- 
ods that  enhance  weights  for  terms  appearing  close  together 
(clusters  of  terms),  and  can  also  be  implemented  efficiendy 
within  the  current  architectiue.  We  have  achieved  good  results 
with  sentence  level  proximity  measures  based  on  a  bonus 
score  for  the  query  terms  that  appear  within  the  same  sentence 
and  within  a  certaui  distance  of  each  other.  The  bonus  is  also 
proportional  to  the  term  weight  itself.  Experiments  that  used  a 
bonus  independent  of  term  weight  dramatically  reduced  per- 
formance (niunbers  not  reported  here),  possibly  due  to  noise 
introduced  by  clusters  of  unimportant  terms.  Similar  experi- 
ments with  paragraph  level  proximity  yielded  significantly 
poorer  results  as  compared  to  sentence  level  proximity. 
Finally  combining  either  document  length  or  inverse  para- 
graph scaling  with  sentence  level  proximity  improved  results. 


V.  CONCLUSIONS 

We  have  successfully  implemented  an  in-memory  com- 
pressed inverted  file  text  retrieval  system  on  the  CMS  Con- 
nection Machine  system. 

Queries  that  used  both  words  and  phrases  composed  of 
adjacent  words  did  better  that  those  that  used  words  alone. 
While  our  experiments  completed  so  far  suggest  that  convert- 
ing everything  to  lower  case  for  adhoc  queries  seems  some- 
what better,  it  is  not  clear  whether  the  minor  differences 
couldn't  be  removed  by  further  optimization  of  other  parame- 
ters. Bq)eriments  that  take  docmnent  length  and  term  position 
into  accoimt  suggest  that  normalizing  for  document  length  and 
increasing  the  weight  of  terms  appearing  earlier  in  the  docu- 
ments lead  to  significant  improvements  for  both  routing  and 
adhoc  queries.  We  have  also  seen  that  proximity  scores  based 
on  nearby  terms  in  a  sentence  improve  retrieval  performance. 

VI.  FUTURE  WORK 

We  would  like  to  compare  the  efiects  of  cosine  normaliza- 
tion with  what  we  have  tried  so  far  and  also  explore  its  inter- 
action with  techniques  that  use  term  position  and  proximity 
measures. 
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ABSTRACT 

This  paper  reports  on  some  recent  developments  in  our 
natural  language  text  retrieval  system.  The  system  uses 
advanced  natural  language  processing  techniques  to 
enhance  the  effectiveness  of  term-based  document 
retrieval.  The  backbone  of  our  system  is  a  traditional  sta- 
tistical engine  which  builds  inverted  index  files  from 
pre-processed  documents,  and  tiien  searches  and  ranks 
the  documents  in  response  to  user  queries.  Natural 
language  processing  is  used  to  (1)  preprocess  the  docu- 
ments in  order  to  extract  content-carrying  terms,  (2)  dis- 
cover mter-term  dependencies  and  biuld  a  conceptual 
hierarchy  specific  to  the  database  domain,  and  (3)  pro- 
cess user's  natural  language  requests  into  effective 
search  queries.  For  the  present  TREC-2  effort,  the  total 
of  550  MBytes  of  Wall  Street  Journal  articles  (ad-hoc 
queries  database)  and  300  MBytes  of  San  Jose  Mercury 
articles  (routing  data)  have  been  processed,  hi  terms  of 
text  quantity  this  represents  approximately  130  milhon 
words  of  English.  Unlike  in  TREC-1,  we  were  able  to 
create  a  single  compound  index  for  each  database,  and 
therefore  avoid  merging  of  results.  While  die  general 
design  of  the  system  has  not  changed  since  TREC-1 
conference,  we  nonetheless  replaced  several  components 
and  added  a  number  of  new  features  which  are  described 
in  the  present  paper. 

INTRODUCTION 

A  typical  information  retrieval  (IR)  task  is  to  select 
doamients  from  a  database  in  response  to  a  user's  query, 
and  rank  these  documents  according  to  relevance.  This 
has  been  usually  accomplished  using  statistical  methods 
(often  coupled  with  manual  encoding)  that  (a)  select 
terms  (words,  phrases,  and  odier  units)  from  documents 
that  are  deemed  to  best  represent  their  content,  and  (b) 
create  an  mverted  mdex  file  (or  files)  that  provide  an 
easy  access  to  documents  containing  these  terms.  A  sub- 
sequent search  process  wiU  attempt  to  match  a  prepro- 
cessed  user  query  (or  queries)  against  term-based 
representations  of  documents  in  each  case  determining  a 
degree  of  relevance  between  the  two  which  depends 
upon  the  number  and  types  of  matching  terms.  Although 


many  sophisticated  search  and  matching  methods  are 
available,  the  crucial  problem  remains  to  be  that  of  an 
adequate  representation  of  content  for  both  the  docu- 
ments and  the  queries. 

The  simplest  word-based  representations  of  con- 
tent are  usually  inadequate  since  single  words  are  rarely 
specific  enough  for  accurate  discrimination,  and  their 
grouping  is  often  accidental.  A  better  method  is  to  iden- 
tify groups  of  words  that  create  meaningful  phrases, 
especially  if  these  phrases  denote  important  concepts  in 
database  domain.  For  example,  joint  venture  is  an  impor- 
tant term  in  the  Wall  Street  Journal  (WSJ  henceforth) 
database,  while  neither  joint  nor  venture  is  important  by 
itself.  In  the  retrieval  experiments  with  the  training 
TREC  database,  we  noticed  that  both  joint  and  venture 
were  dropped  from  the  Ust  of  terms  by  the  system 
because  their  idf  {inverted  document  frequency)  weights 
were  too  low.  hi  large  databases,  such  as  TIPSTER,  die 
use  of  phrasal  terms  is  not  just  desirable,  it  becomes 
necessary. 

An  accurate  syntactic  analysis  is  an  essential  prere- 
quisite for  selection  of  phrasal  terms.  Various  statistical 
methods,  e.g.,  based  on  word  co-occurrences  and  mutual 
information,  as  well  as  partial  parsmg  techniques,  are 
prone  to  high  error  rates  (sometimes  as  high  as  50%), 
turning  out  many  unwanted  associations.  Therefore  a 
good,  fast  parser  is  necessary,  but  it  is  by  no  means 
sufficient.  While  syntactic  phrases  are  often  better  indi- 
cators of  content  than  'statistical  phrases'  --  where  words 
are  grouped  solely  on  die  basis  of  physical  proximity 
(e.g.,  "college  junior"  is  not  the  same  as  "junior  college") 
"  the  creation  of  compound  terms  makes  term  matching 
process  more  complex  since  in  addition  to  die  usual 
problems  of  synonymy  and  subsumption,  one  must  deal 
with  their  structure  (e.g.,  "college  junior"  is  tiie  same  as 
"junior  in  college").  In  order  to  deal  with  structure,  die 
parser's  output  needs  to  be  "normalized"  or  "regularized" 
so  diat  complex  terms  with  the  same  or  closely  related 
meanings  would  indeed  receive  matching  representa- 
tions. TTiis  goal  has  been  achieved  to  a  certain  extent  in 
the  present  work.  As  it  will  be  discussed  in  more  detail 
below,  indexiag  terms  were  selected  from  among  head- 
modifier   pairs    extracted    from  predicate-argument 
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representations  of  sentences. 

Introduction  of  compound  terms  also  complicates 
the  task  of  discovery  of  various  semantic  relaticmships 
among  them,  including  synonymy  and  subsmnption.  For 
example,  the  term  natural  language  can  be  considered,  in 
certain  domains  at  least,  to  subsume  any  term  denoting  a 
specific  human  language,  such  as  English.  Therefore,  a 
query  containing  the  former  may  be  expected  to  retrieve 
documents  containing  the  latter.  The  same  can  be  said 
about  language  and  English,  unless  language  is  in  fact  a 
part  of  the  compound  term  programming  language  in 
which  case  the  association  language  -  Fortran  is 
appropriate.  This  is  a  problem  because  (a)  it  is  a  standard 
practice  to  include  both  simple  and  compound  terms  in 
document  representation,  and  (b)  term  associations  have 
thus  far  been  computed  primarily  at  word  level  (includ- 
ing fixed  phrases)  and  therefore  care  must  be  taken  when 
such  associations  are  used  in  term  matching.  This  may 
prove  particularly  troublesome  for  systems  that  attempt 
term  clustering  in  order  to  create  "meta-terms"  to  be  used 
in  document  representation. 

The  system  presented  here  computes  term  associa- 
tions from  text  at  word  and  fixed  phrase  level  and  then 
uses  these  associations  in  query  expansion.  A  fairly 
primitive  filter  is  employed  to  separate  synonymy  and 
subsimiption  relationships  from  others  including  anto- 
nymy  and  complementation,  some  of  which  are  strongly 
domain-dependent.  This  process  has  led  to  an  increased 
retrieval  precision  in  experiments  widi  both  ad-hoc  and 
routing  queries  for  TREC-1  and  TREC-2  experiments. 
However,  the  actual  improvement  levels  can  vary  sub- 
stantially between  different  databases,  types  of  runs  (ad- 
hoc  vs.  routing),  as  well  as  the  degree  of  prior  processing 
of  the  queries.  We  continue  to  study  more  advanced 
clustering  methods  along  with  the  changes  in  interpreta- 
tion of  resulting  associations,  as  signaled  in  the  previous 
paragraph. 

In  the  remainder  of  this  paper  we  discuss  particu- 
lars of  the  present  system  and  some  of  the  observations 
made  while  processing  TREC-2  data.  The  above  com- 
ments will  provide  the  background  for  situating  our 
present  effort  and  the  state-of-the-art  with  respect  to 
where  we  should  be  in  the  future. 


OVERALL  DESIGN 

Our  information  retrieval  system  consists  of  a  trad- 
itional statistical  backbone  (NIST's  PRISE  system;  Har- 
man  and  Candela,  1989)  augmented  with  various  natural 
language  processing  components  that  assist  the  system  in 
database  processing  (stemming,  indexing,  word  and 
phrase  clustering,  selectional  restrictions),  and  translate  a 
user's  information  request  into  an  effective  query.  This 
design  is  a  careful  compromise  between  purely  statistical 
non-lingmstic  approaches  and  those  requiring  rather 


accomplished  (and  expensive)  semantic  analysis  of  data, 
often  referred  to  as  'conceptual  retrieval'. 

In  our  system  the  database  text  is  first  processed 
with  a  fast  syntactic  parser.  Subsequentiy  certain  types  of 
phrases  are  extracted  from  the  parse  trees  and  used  as 
compound  indexing  terms  in  addition  to  single-word 
terms.  The  extracted  phrases  are  statistically  analyzed  as 
syntactic  contexts  in  order  to  discover  a  variety  of  simi- 
larity links  between  smaller  subphrases  and  words  occur- 
ring in  them.  A  further  filtering  process  maps  these  simi- 
larity links  onto  semantic  relations  (generahzation,  spe- 
cialization, synonymy,  etc.)  after  which  they  are  used  to 
transform  a  user's  request  into  a  search  query. 

The  user's  natural  language  request  is  also  parsed, 
and  all  mdexing  terms  occurring  in  it  are  identified.  Cer- 
tain highly  ambiguous,  usually  single-word  terms  may  be 
dropped,  provided  that  they  also  occur  as  elements  in 
some  compound  terms.  For  example,  "natural"  is  deleted 
from  a  query  already  containing  "natural  language" 
because  "natural"  occurs  in  many  unrelated  contexts: 
"natural  nimiber",  "natural  logarithm",  "natural 
approach",  etc.  At  the  same  time,  other  terms  may  be 
added,  namely  those  which  are  linked  to  some  query 
term  through  admissible  similarity  relations.  For  exam- 
ple, "unlawful  activity"  is  added  to  a  query  (TREC  topic 
055)  containing  the  compound  term  "illegal  activity"  via 
a  synonymy  link  between  "illegal"  and  "unlawful". 

One  of  die  striking  observations  made  during  the 
course  of  TREC-2  was  to  note  diat  removing  low-quality 
terms  from  the  queries  is  at  least  as  important  (and  often 
more  so)  as  adding  synmyms  and  speciafizations.  In 
some  instances  (e.g.,  routing  runs)  low-quality  terms  had 
to  be  removed  (or  inhibited)  before  similar  terms  could 
be  added  to  the  query  or  else  the  effect  of  query  expan- 
sion was  all  but  drowned  out  by  the  increased  noise.  ^ 

After  the  final  query  is  constructed,  the  database 
search  follows,  and  a  ranked  Ust  of  documents  is 
returned.  It  should  be  noted  that  all  the  processing  steps, 
those  performed  by  the  backbone  system,  and  those  per- 
formed by  the  natural  language  processing  components, 
are  fully  automated,  and  no  human  intervention  or 
manual  encoding  is  required. 

FAST  PARSING  WITH  TTP  PARSER 

TTP  (Tagged  Text  Parser)  is  based  on  the  Linguis- 
tic String  Grammar  developed  by  Sager  (1981).  The 
parser  currentiy  encompasses  some  400  grammar  pro- 
ductions, but  it  is  by  no  means  complete.  The  parser's 
output  is  a  regularized  parse  tree  representation  of  each 


'  We  would  like  to  thank  Donna  Harman  for  turning  our  attention 
to  the  importance  of  term  weighting  schemes,  including  term  deletion. 
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sentence,  that  is,  a  representation  that  reflects  the 
sentence's  logical  predicate-argument  structure.  For 
example,  logical  subject  and  logical  object  are  identified 
in  both  passive  and  active  sentences,  and  noun  phrases 
are  organized  aroimd  their  head  elements.  The  parser  is 
equipped  with  a  powerful  skip-and-fit  recovery  mechan- 
ism diat  allows  it  to  operate  effectively  in  die  face  of  ill- 
formed  input  or  under  a  severe  time  pressiue.  In  die  runs 
with  approximately  130  million  words  of  TREC's  Wall 
Street  Journal  and  San  Jose  Mercury  texts,^  the  parser's 
speed  averaged  between  0.3  and  0.5  seconds  per  sen- 
tence, or  up  to  70  words  per  second,  on  a  Sun's  SparcS- 
tation2.  In  addition,  TTP  has  been  shown  to  produce 
parse  structures  which  are  no  worse  than  those  generated 
by  full-scale  linguistic  parsers  when  compared  to  hand- 
coded  Treebank  parse  trees. 

TTP  is  a  full  grammar  parser,  and  initially,  it 
attempts  to  generate  a  complete  analysis  for  each  sen- 
tence. However,  unlike  an  ordinary  parser,  it  has  a  built- 
in  timer  which  regulates  the  amount  of  tune  allowed  for 
parsing  any  one  sentence.  If  a  parse  is  not  returned 
before  die  allotted  time  elapses,  die  parser  enters  the 
skip-and-fit  mode  in  which  it  will  try  to  "fit"  die  parse. 
While  in  the  skip-and-fit  mode,  the  parser  will  attempt  to 
forcibly  reduce  incomplete  constituents,  possibly  skip- 
ping portions  of  input  in  order  to  restart  processing  at  a 
next  imattempted  constiment.  In  other  words,  the  parser 
will  favor  reduction  to  backtracking  while  in  the  skip- 
and-fit  mode.  The  result  of  this  strategy  is  an  approxi- 
mate parse,  partially  fitted  using  top-down  predictions. 
The  fragments  skipped  m  the  first  pass  are  not  thrown 
out,  instead  they  are  analyzed  by  a  simple  phrasal  parser 
that  looks  for  noun  phrases  and  relative  clauses  and  then 
attaches  die  recovered  material  to  the  main  parse  struc- 
ture. Full  details  of  TTP  parser  have  been  described  in 
the  TREC-1  report  (Strzalkowski,  1993a),  as  well  as  in 
other  works  (Strzalkowski,  1992;  Strzalkowski  & 
Scheyen,  1993). 

As  may  be  expected,  the  skip-and-fit  strategy  will 
only  be  effective  if  the  input  skipping  can  be  performed 
with  a  degree  of  determinism.  This  means  that  most  of 
the  lexical  level  ambiguity  must  be  removed  from  the 
input  text,  prior  to  parsing.  We  achieve  diis  using  a  sto- 
chastic parts  of  speech  tagger  to  preprocess  die  text  (see 
TREC-1  report  for  details).  For  TREC-2  a  number  of 
problems  have  been  corrected  in  the  tagger,  including 
improper  tokenization  of  input  and  handling  of  abbrevia- 
tions. 


WORD  SUFFIX  TRIMMER 

Word  stemming  has  been  an  effective  way  of 
improving  document  recall  since  it  reduces  words  to  dieir 
common  morphological  root,  dius  allowing  more  suc- 
cessful matches.  On  die  odier  hand,  stemming  tends  to 
decrease  retrieval  precision,  if  care  is  not  taken  to 
prevent  situations  where  otherwise  unrelated  words  are 
reduced  to  die  same  stem.  In  our  system  we  replaced  a 
traditional  morphological  stemmer  with  a  conservative 
dictionary-assisted  suffix  trimmer.  ^  The  suffix  trimmer 
performs  essentially  two  tasks:  (1)  it  reduces  inflected 
word  forms  to  their  root  forms  as  specified  in  die  diction- 
ary, and  (2)  it  converts  nominaUzed  verb  forms  (e.g., 
"implementation",  "storage")  to  die  root  forms  of 
corresponding  verbs  (i.e.,  "implement",  "store").  This  is 
accomphshed  by  removing  a  standard  suffix,  e.g., 
"stor+age".  replacing  it  widi  a  standard  root  ending 
("+e").  and  checking  die  newly  created  word  against  the 
dictionary,  i.e.,  we  check  whetiier  die  new  root  ("store") 
is  mdeed  a  legal  word. 


HEAD-MODIFIER  STRUCTURES 

Syntactic  phrases  extracted  from  TTP  parse  trees 
are  head-modifier  pahs.  The  head  in  such  a  pair  is  a  cen- 
tral element  of  a  phrase  (main  verb,  main  noim,  etc.), 
while  the  modifier  is  one  of  the  adjimct  arguments  of  the 
head.  In  the  TREC  experiments  reported  here  we 
extracted  head-modifier  word  and  fixed-phrase  pairs 
only.  While  TREC  databases  are  large  enough  to  warrant 
generation  of  larger  compounds,  we  were  in  no  position 
to  verify  their  effectiveness  m  indexmg.  This  was  largely 
because  of  die  tight  schedule,  but  also  because  of  rapidly 
escalating  complexity  of  the  indexing  process:  even  witii 
2-word  phrases,  compound  terms  accounted  for  nearly 
88%  of  all  index  entries,  m  other  words,  including  2- 
word  phrases  mcreased  die  index  size  approximately  8 
times. 

Let  us  consider  a  specific  example  from  the  WSJ 
database: 

The  former  Soviet  president  has  been  a  local  hero 
ever  since  a  Russian  tank  invaded  Wisconsin. 

The  tagged  sentence  is  given  below,  followed  by  the  reg- 
ularized parse  structure  generated  by  TTP,  given  in  Fig- 
ure 1. 

Tht/dt  former/y  Soviet/yy  president/nw  has/vbz 
he&n/vbn  sjdt  local//;  hero/««  ever/rb  since/m  n/dt 
Russian/y  tank/«n  invaded/vM  Wisconsin/np  ./per 


'  Dealing  with  prefixes  is  a  more  complicated  matter,  since  they 

  may  have  quite  strong  effect  upon  the  meaning  of  the  resulting  term, 

^  Approximately  0.85  GBytes  of  text,  over  6  million  sentences.  e.g.,  un-  usually  introduces  explicit  negation. 
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[assert 
[[perf[HAVE]] 
[[verb  [BE]] 
[subject 
[np 

[n  PRESIDENT] 
[t_posTHE] 
[adj  [FORMER]] 
[adj  [SOVIET]]]] 
[object 
[np 
[nHERO] 
[t_pos  A] 
[adj  [LOCAL]]]] 
[adv  EVER] 
[sub_ord 
[SINCE 
[[verb  [INVADE]] 
[subject 
[np 
[n  TANK] 
[t_pos  A] 

[adj  [RUSSL\N]]]] 
[object 
[np 

[name  [WISCONSIN]]]]]]]]]] 
Figure  1.  Predicate-argument  parse  structure. 

It  should  be  noted  that  the  parser's  output  is  a 
predicate-argument  structvu-e  centered  around  main  ele- 
ments of  various  phrases.  Li  Figure  1,  BE  is  the  main 
predicate  (modified  by  HAVE)  with  2  arguments  (sub- 
ject, object)  and  2  adjuncts  (adv,  sub  ord).  INVADE  is 
the  predicate  in  the  subordinate  clause  with  2  argmnents 
{subject,  object).  The  subject  of  BE  is  a  noun  phrase 
with  PRESIDENT  as  the  head  element,  two  modifiers 
(FORMER,  SOVIET)  and  a  determiner  (THE).  From  this 
structure,  we  extract  head-modifier  pairs  that  become 
candidates  for  compound  terms.  The  following  types  of 
pairs  are  considered:  (1)  a  head  noun  and  its  left  adjec- 
tive or  noun  adjimct,  (2)  a  head  noun  and  the  head  of  its 
right  adjunct,  (3)  the  main  verb  of  a  clause  and  the  head 
of  its  object  phrase,  and  (4)  the  head  of  the  subject 
phrase  and  the  main  verb.  These  types  of  pairs  account 
for  most  of  the  syntactic  variants  for  relatmg  two  words 
(or  simple  phrases)  into  pairs  carrying  compatible 
semantic  content.  For  example,  the  pair 
re trieve+ information  will  be  extracted  from  any  of  the 
following  fragments:  information  retrieval  system; 
retrieval  of  information  from  databases;  and  information 
that  can  be  retrieved  by  a  user-controlled  interactive 
search  process.  In  the  example  at  hand,  die  following 
head-modifier  pairs  are  extracted  (pairs  containing  low- 
content  elements,  such  as  BE  and  FORMER,  or  names. 


such  as  WISCONSIN,  will  be  later  discarded): 

PRESIDENT+BE,  PRESIDENT+FORMER,  PRESIDENT+SOVEET, 
BE+HERO,  HERO+LOCAL, 

TANK+INVADE.  TANK+RUSSL\N,  INVADE+WISCONSIN 

We  may  note  that  the  three-word  phrase  former  Soviet 
president  has  been  broken  into  two  pairs  former 
president  and  Soviet  president,  both  of  which  denote 
things  that  are  potentially  quite  different  from  what  the 
origmal  phrase  refers  to,  and  this  fact  may  have  poten- 
tially negative  effect  on  retrieval  precision.  This  is  one 
place  where  a  longer  phrase  appears  more  appropriate. 
The  representation  of  this  sentence  may  therefore  contain 
the  following  terms  (along  with  their  inverted  document 
frequency  weights): 

PRESIDENT  2.623519 

SOVIET  5.416102 

PRESIDENT+SOVDET  11.556747 

PRESIDENT+FORMER  14.594883 

HERO  7.896426 

HERO+LOCAL  14.314775 

INVADE  8.435012 

TANK  6.848128 

TANK+INVADE  17.402237 

TANK+RUSSIAN  16.030809 

RUSSL«iN  7.383342 

WISCONSIN  7.785689 

While  generating  compound  terms  we  took  care  to  iden- 
tify 'negative'  terms,  that  is,  those  whose  denotations 
have  been  explicitly  excluded  by  negation.  Even  though 
matching  of  negative  terms  was  not  used  in  retrieval  (nor 
did  we  use  negative  weights),  we  could  easily  prevent 
matching  a  negative  term  in  a  query  against  its  positive 
counterpart  in  the  database  by  removing  known  negative 
terms  from  queries.  As  an  example  consider  the  follow- 
ing fragment  from  topic  067: 

It  should  NOT  be  about  economically-motivated 
civil  disturbances  and  NOT  be  about  a  civil  distur- 
bance directed  against  a  second  country. 

The  corresponding  compound  terms  are: 

NOT  disturb+civil 
NOT  country +second 
NOT  direct+disturb 

The  particular  way  of  interpreting  syntactic  con- 
texts was  dictated,  to  some  degree  at  least,  by  statistical 
considerations.  Our  initial  experiments  were  performed 
on  a  relatively  small  collection  (CACM-3204),  and  there- 
fore we  combined  pairs  obtained  from  different  syntactic 
relations  (e.g.,  verb-object,  subject-verb,  noim-adjunct. 
etc.)  in  order  to  increase  frequencies  of  some  associa- 
tions. This  became  largely  unnecessary  in  a  large  collec- 
tion such  as  TIPSTER,  but  we  had  no  means  to  test  alter- 
native options,  and  thus  decided  to  stay  with  the  original. 
It  should  not  be  difficult  to  see  that  this  was  a  comprom- 
ise solution,  since  many  important  distinctions  were 
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potentially  lost,  and  strong  associations  could  be  pro- 
duced where  there  weren't  any.  A  way  to  improve  things 
is  to  consider  different  syntactic  relations  independently, 
perhaps  as  independent  sources  of  evidence  that  could 
lend  support  (or  not)  to  certain  term  similarity  predic- 
tions. We  have  started  investigating  this  option  during 
TREC-2,  however,  it  has  not  been  sufficiently  tested  yet. 

One  difficulty  in  obtaining  head-modifier  pairs  of 
highest  accuracy  is  the  notorious  ambiguity  of  nominal 
compounds.  For  example,  the  phrase  natural  language 
processing  should  generate  language  ^natural  and 
processing+language,  while  dynamic  information  pro- 
cessing is  expected  to  yield  processing+dynamic  and 
processing+information.  A  still  another  case  is  executive 
vice  president  where  the  association  president+executive 
may  be  stretching  things  a  bit  too  far.  Since  our  parser 
has  no  knowledge  about  the  text  domain,  and  uses  no 
semantic  preferences,  it  does  not  attempt  to  guess  any 
internal  associations  within  such  phrases.  Instead,  this 
task  is  passed  to  the  pair  extractor  module  which 
processes  ambiguous  parse  structures  in  two  phases.  In 
phase  one,  all  and  only  unambiguous  head-modifier  pairs 
are  extracted,  and  the  frequencies  of  their  occurrences 
are  recorded.  In  phase  two,  frequency  information  about 
pairs  generated  in  the  first  pass  is  used  to  form  associa- 
tions firom  ambiguous  shnctures.  For  example,  if 
language+natural  has  occurred  imambiguously  a 
niunber  of  times  in  contexts  such  as  parser  for  natural 
language,  while  processing +natural  has  occurred 
significantiy  fewer  times  or  perhaps  none  at  all,  then  we 
will  prefer  the  former  association  as  vaUd.  In  TREC-2 
phrase  disambiguation  was  not  used,  instead  we  decided 
to  avoid  ambiguous  phrases  altogether.  While  our  disam- 
big  program  worked  generally  satisfactorily,  we  could 
resolve  only  a  small  fraction  of  cases  (about  7%)  and 
thus  it's  impact  on  the  overall  system's  performance  was 
limited.  However,  query-level  disambiguation  may  be 
more  important. 


TERM  CORRELATIONS  FROM  TEXT 

Head-modifier  pairs  form  compound  tenns  used  in 
database  indexing.  They  also  serve  as  occurrence  con- 
texts for  smaller  terms,  including  single- word  terms.  If 
two  terms  tend  to  be  modified  with  a  number  of  common 
modifiers  and  otherwise  appear  in  few  distinct  contexts, 
we  assign  them  a  similarity  coefficient,  a  real  number 
between  0  and  1.  The  similarity  is  determined  by  com- 
paring distribution  characteristics  for  botii  terms  within 
tiie  corpus:  how  much  information  content  do  they  carry, 
do  their  information  contribution  over  contexts  vary 
gready,  are  the  common  contexts  in  which  these  terms 
occur  specific  enough?  In  general  we  will  credit  high- 
content  terms  appearing  in  identical  contexts,  especially 


if  these  contexts  are  not  too  commonplace.'* 

For  TREC-2  runs  we  used  a  similarity  fcffmula 
which,  unlike  Uie  similarity  formula  used  in  TREC-1, 
produces  clusters  of  related  words  and  phrases,  but  will 
not  generate  imiform  term  similarity  ranking  across  clus- 
ters. This  new  formula,  however,  appeared  better  suited 
to  handle  the  diverse  subject  matter  of  the  WSJ  database. 
We  used  a  (revised)  variant  of  weighted  Tanimoto's 
measure  described  in  (Grrefenstette,  1992): 


with 


Y,MlNW{\.x,att]),W(\y,att^) 

alt 

£M/lA:(iy([;c,a«]).M/([y,a/^]) 

alt 


W{[x,y'i)  =  GEW{x)*log{f,^y) 


GEW{x)  =  l  +  ^ 


^*log 


Jx,y 

n 


y  J 


logiN) 


In  the  above,  f^y  stands  for  absolute  frequency  of  pair 
[•^.>'  ].  ^  is  the  frequency  of  term  y,  and  is  the  number 
of  single-word  terms.  Sample  clusters  obtaiued  from 
approx.  250  MByte  (42  million  words)  subset  of  WSJ 
(years  1990-1992)  are  given  in  Table  1. 

In  order  to  generate  better  similarities,  we  require 
tiiat  words  Xi  and  X2  appear  in  at  least  M  distinct  com- 
mon contexts,  where  a  common  context  is  a  couple  of 
pairs  [xi  ,y]  and  [x2,y],  or  [y,Xi]  and  [yjC2]  such  that  they 
each  occurred  at  least  three  times.  Thus,  banana  and  Bal- 
tic will  not  be  considered  for  similarity  relation  on  die 
basis  of  their  occurrences  in  the  common  context  of 
republic,  no  matter  how  frequent,  unless  there  is  another 
such  common  context  comparably  frequent  (there  wasn't 
any  in  TREC's  WSJ  database).  For  smaller  or  narrow 
domain  databases  Af=2  is  usually  sufficient.  For  large 
databases  covering  a  rather  diverse  subject  matter,  like 
WSJ  or  SJMN  (San  Jose  Mercury  News),  we  used  M^5.^ 
This,  however,  turned  out  not  to  be  sufficient.  We  would 
still  generate  fairly  strong  similarity  links  between  terms 
such  as  aerospace  and  pharmaceutical  where  6  and  more 
common  contexts  were  found.  In  the  example  at  hand  the 
following  common  contexts  were  located,  all  occurring 
at  the  head  (left)  position  of  a  pair  (at  right  are  their  glo- 
bal entropy  weights  and  frequencies  with  aerospace  and 


*  It  would  not  be  appropriate  to  predict  similarity  between 
language  and  logarithm  on  the  basis  of  their  co-occurrence  with  rnlur- 
al. 

'  For  example  banana  and  Dominican  were  found  to  have  two 
common  contexts:  republic  and  plant,  although  this  second  occurred  in 
apparently  different  senses  in  Dominican  plant  and  banana  plant. 


pharmaceutical,  respectively): 


finn 

industry 

sector 

concern 

analyst 

division 

giant 


GEW=0.58 
GEW=0.51 
GEW=0.61 
GEW=0.50 
GEW=0.62 
GEW=0.53 
GEW=0.62 


fxly=9  fx2y=22 

fxly=84  fx2y=56 

fxly=5  fx2y=9 

fxly=130  fx2y=115 

fxly=23  fx2y=8 

fxly=36  fx2y=28 

fxly=15  fx2y=12 


Note  that  while  some  of  these  weights  are  qxiite  low  (less 
than  0.6  ~  GEW  takes  values  between  0  and  1),  thus 
indicating  a  low  importance  context,  the  frequencies  with 
which  these  contexts  occurred  with  both  terms  were  high 
and  balanced  on  both  sides  (e.g.,  concern),  thus  adding  to 
the  strength  of  association.  We  are  now  considering  addi- 
tional thresholds  to  bar  low  importance  contexts  from 
being  used  in  similarity  calculation. 

It  may  be  worth  pointing  out  tiiat  the  similarities 
are  calculated  using  term  co-occurrences  in  syntactic 
rather  than  in  document-size  contexts,  the  latter  being  the 
usual  practice  ia  non-linguistic  clustering  (e.g.,  Sparck 
Jones  and  Barber,  1971;  Crouch,  1988;  Lewis  and  Goft, 
1990).  Although  the  two  methods  of  term  clustering  may 
be  considered  mutually  complementary  in  certain  situa- 
tions, we  beheve  that  more  and  stronger  associations  can 
be  obtained  through  syntactic -context  clustering,  given 
sufficient  amount  of  data  and  a  reasonably  accurate  syn- 
tactic parser.' 


QUERY  EXPANSION 

Similarity  relations  are  used  to  expand  user  queries 
with  new  terms,  in  an  attempt  to  make  the  final  search 
query  more  comprehensive  (adding  synonyms)  and/or 
more  pointed  (adding  specializations).^  It  follows  that  not 
all  similarity  relations  will  be  equally  useful  in  query 
expansion,  for  instance,  complementary  and  antonymous 
relations  like  the  one  between  Australian  and  Canadian, 
accept  and  reject,  or  even  generalizations  like  from 


*  Other  common  contexts,  such  as  company  or  market,  have  al- 
ready been  rejected  because  they  were  paired  with  too  many  different 
words  (a  high  dispersion  ratio,  see  note  12). 

'  Non-syntactic  contexts  cross  sentence  boundaries  with  no  fuss, 
which  is  helpfiil  with  short,  succinct  documents  (such  as  CACM 
abstracts),  but  less  so  with  longer  texts;  see  also  (Grishman  et  al.,  1986). 

'  Query  expansion  (in  the  sense  considered  here,  though  not  quite 
in  the  same  way)  has  been  used  in  information  retrieval  research  before 
(e.g.,  Sparck  Jones  and  Tait,  1984;  Harman,  1988),  usually  with  mixed 
results.  An  alternative  is  to  use  term  clusters  to  create  new  terms,  "meta- 
terms",  and  use  them  to  index  the  database  instead  (e.g..  Crouch,  1988; 
Lewis  and  Croft,  1990).  We  found  that  the  query  expansion  approach 
gives  the  system  more  flexibility,  for  instance,  by  making  room  for 
hypertext-style  topic  exploration  via  user  feedback. 


aerospace  to  industry  may  actually  harm  system's  per- 
formance, since  we  may  end  up  retrieving  many 
irrelevant  documents.  On  the  otiier  hand,  database  search 
is  likely  to  miss  relevant  documents  rf  we  overlook  the 
fact  that  vice  director  can  also  be  deputy  director,  or  that 
takeover  can  also  be  merge,  buy-out,  or  acquisition.  We 
noted  that  an  average  set  of  similarities  generated  from  a 
text  corpus  contains  about  as  many  "good"  relations 
(synonymy,  specialization)  as  "bad"  relations  (antonymy, 
complementation,  generalization),  as  seen  from  the  query 
expansion  viewpoint.  Therefore  any  attempt  to  separate 
diese  two  classes  and  to  increase  the  proportion  of 
"good"  relations  should  result  in  improved  retrieval.  This 
has  indeed  been  confirmed  in  our  experiments  where  a 
relatively  crude  filter  has  visibly  increased  retrieval  pre- 
cision. 

In  order  to  create  an  appropriate  filter,  we  devised 
a  global  term  specificity  measure  (GTS)  which  is  calcu- 
lated for  each  term  across  all  contexts  in  which  it  occurs. 
The  general  philosophy  here  is  that  a  more  specific 
word/phrase  would  have  a  more  limited  use,  i.e.,  a  more 
specific  term  would  appear  in  fewer  distinct  contexts.  In 
this  respect,  GTS  is  similar  to  the  standard  inverted  docu- 
ment frequency  ( idf)  measure  except  that  term  frequency 
is  measured  over  syntactic  units  rather  than  document 
size  units.^  Terms  with  higher  GTS  values  are  generally 
considered  more  specific,  but  the  specificity  comparison 
is  only  meaningful  for  terms  which  are  already  known  to 
be  sunilar.  The  new  function  is  calculated  according  to 
the  following  fonniila: 


GTS(w)=-{ 


ICiiw)  *  ICr(w)  if  both  exist 
ICniw)  ii only  ICfiiw)  exists 

ICiiw)  otherwise 


where  (with  n^^.  dy^,  >  0): 


/Q(w)  =  /C([h;,J)  = 


/C«(m;)=/C(L,>v])  = 


d^(n^+d^-l) 
d^(n^+d^-l) 


For  any  two  terms  Wi  and  W2,  and  a  constant  6  >  1,  if 
GTS(w2)tb*  GTS(wi)  then  is  considered  more 
specific  than  Wi.  In  addition,  if  SIM^rmi^i'^i)  =  o>  Q, 
where  0  is  an  empirically  established  threshold,  then  W2 
can  be  added  to  the  query  containing  term  Wj  with 
weight  cf}^  For  example,  the  following  were  obtained 


'  We  believe  that  measuring  term  specificity  over  document-size 
contexts  (e.g.,  Sparck  Jones,  1972)  may  not  be  appropriate  in  this  case. 
In  particular,  syntax-based  contexts  allow  for  processing  texts  without 
any  internal  document  structure. 

For  TREC-2  we  used  a  =  0.2;  6  varied  between  10  and  100. 
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from  the  WSJ  training  database: 

GTS  {takeover)  =  0.00145576 
GTS  {merge)  =  0.00094518 
GTS  {buy  -out)  =  0.00272580 
GTS  {acquire )     =  0.00057906 

with 

SIM  {takeover, merge)  =  0. 190444 
SIM  {takeover, buy -out)  =  0. 157410 
SIM  {takeover, acquire )  =  0. 1 39497 
SIM  {merge, buy  -out)  =  0. 133800 
SIM  {merge, acquire)  =0.263772 
SIM  {buy -out, acquire)     =  0. 109106 

Therefore  both  takeover  and  buy-out  can  be  used  to  spe- 
cialize merge  or  acquire.  With  this  filter,  the  relation- 
ships between  takeover  and  buy-out  and  between  merge 
and  acquire  are  either  both  discarded  or  accepted  as 
synonymous.  At  this  time  we  are  unable  to  tell 
synonymous  or  near  synonymous  relationships  from 
those  which  are  primarily  complementary,  e.g.,  man  and 
woman. 

In.  TREC-1  the  impact  of  query  expansion  through 
term  similarities  on  the  system's  overall  performance 
was  generally  disappointing.  For  TREC-2  we  have  made 
a  number  of  changes  to  the  term  correlation  model,  but 
again  time  limitations  prevented  us  from  properly  testing 
all  options.  Among  the  most  important  changes  are: 

(1)  Exclusion  of  pairs  obtained  from  SUBJECT- 
VERB  relations:  we  determined  diat  these  con- 
texts are  generally  of  little  use  as  neither  subject 
nor  verb  subcategorizes  well  for  the  other.  More- 
over we  observed  that  the  presence  of  these  pairs 
was  the  source  of  many  unwanted  term  associa- 
tions.^^ 

(2)  Automatic  pruning  of  low-content  terms  from  the 
queries:  terms  with  low  idf  weights,  terms  with 
low  information  contribution  weights  that  are 
elements  of  compound  terms,  are  removed  from 
queries  before  database  search.  As  we  tuned 
various  cutoff  thresholds  we  noted  that  a 
significant  increase  in  both  recall  and  precision 
could  be  obtained. 


"  Subject- Verb  pairs  were  retained  as  compound  terms,  however. 

The  Information  Contribution  measure  indicates  the  strength  of 

word  pairings,  and  is  defined  as  IC(x,  [x,y])=   where  is 

the  absolute  frequency  of  pair  [x,y]  in  the  corpus,  rij,  is  the  frequency  of 
term  x  at  the  head  position,  and  dj^  is  a  dispersion  parameter  imderstood 
as  the  number  of  distinct  syntactic  contexts  in  which  term  x  is  found. 


word 

cluster 

takeover 

merge,  buy-out,  acquire,  bid 

benefit 

compensate,  aid,  expense 

capital 

cash.fiind,  money 

staff 

personnel,  employee,  force 

attract 

lure,  draw,  woo 

sensitive 

crucial,  difficult,  critical 

speculate 

rumor,  uncertainty,  tension 

president 

director,  executive,  chairman 

vice 

deputy 

outlook 

forecast,  prospect,  trend 

law 

rule,  policy,  legislate,  bill 

earnings 

profit,  revenue,  income 

portfolio 

asset,  invest,  loan 

inflate 

growth,  demand,  earnings 

industry 

business,  company,  market 

growth 

increase,  rise,  gain 

firm 

bank,  concern,  group,  unit 

environ 

climate,  condition,  situation 

debt 

loan,  secure,  bond 

lawyer 

attorney 

counsel 

attorney,  administrator,  secretary 

compute 

machine,  software,  equipment 

competitor 

rival,  competition,  buyer 

alliance 

partnership,  venture,  consortium 

big 

large,  major,  huge,  significant 

fight 

battle,  attack,  war,  challenge 

base 

facile,  source,  reserve,  support 

shareholder 

creditor,  customer,  client 
investor,  stockholder 

Table  1.  Selected  clusters  obtained  from  syntactic  contexts,  derived 
from  approx.  40  million  words  of  WSJ  text,  with  weighted  Tanimoto 
formula. 
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CREATING  AN  INDEX 

Most  of  the  problems  we  encountered  when  adapt- 
ing NIST's  PRISE  system  for  use  in  TREC-2  had  to  do 
with  the  size  of  the  data  that  had  to  be  indexed. 

We  had  to  deal  with  the  restrictions  imposed  by  the 
resources  we  had  (e.g.,  only  96  MBytes  of  virtual 
memory).  The  rest  of  this  paragraph  signals  some  of  the 
changes  we  made  to  the  NIST  system  in  order  to  deal 
with  our  restrictions.  The  original  system  would  request 
twice  the  previously  requested  amount  of  memory  each 
time  it  needed  more.  As  a  result  of  this  the  system  would 
reach  the  limit  of  virtual  memory  after  only  a  relatively 
small  portion  of  the  total  number  of  docimients  had  been 
indexed.  In  our  version,  the  memory  requested  by  the 
system  grows  linearly.  The  increments  are  estimated  in 
such  a  way  that  the  system  never  requests  too  much 
memory. 

The  indexing  process  became  too  fragile  when  the 
limits  of  the  environment  were  approached.  When  a 
large  portion  of  the  virtual  memory  and  of  the  disk  space 
was  being  used  by  the  indexing  process,  crashes  became 
very  likely.  Unfortunately,  it  turned  out  that  the  process 
was  very  difficult  to  restart  after  some  crashes  (e.g.,  in 
the  rebuild  phase),  thus  leading  to  time  consiuning 
repeats. 

Indexing  also  takes  too  long  at  present.  Given  the 
size  of  the  data  to  be  indexed  the  whole  process  takes  at 
least  250  hours  if  everything  goes  well,  which  happens 
seldom.  Given  TREC-2's  deadlines  we  could  not  afford 
to  perform  too  many  experiments:  we  barely  had  time  to 
index  the  corpus  once. 

Most  of  the  previous  problems  could  be  solved  by 
distributing  the  indexing  process  to  several  different 
machines  and  performmg  the  mdexing  m  paraUel. 

We  beheve  that  it  is  possible  to  create  several 
small  indexes  instead  of  a  single  very  large  one.  If  cer- 
tain rules  are  followed  when  creating  the  distributed 
index,  it  should  be  possible  to  merge  the  results  of  query- 
ing the  set  of  small  indexes  and  to  obtain  a  performance 
(recall  and  precision)  comparable  to  the  results  obtained 
using  a  single  index.  The  test  setup  we  built  in  order  to 
perform  the  experiments  required  for  TREC-2  should 
allow  us  to  test  these  hypotheses.  The  advantages  of  a 
distributed  index  are  clear: 

(1)  The  indexing  process  would  be  faster. 

(2)  Each  one  of  the  distributed  indexing  processes 
would  be  smaller  and  less  fragile. 

(3)  Even  if  one  of  the  distributed  processes  crashes 
restarting  it  would  be  less  expensive. 

(4)  A  distributed  system  would  be  much  easier  to 
update,  i.e.,  adding  a  new  document  would  not 
require  to  reindex  the  whole  corpus. 


(5)  A  distributed  system  would  be  more  likely  to  be 
useful  in  order  to  study  the  kinds  of  problems  and 
soluti^ons  that  are  likely  to  be  encountered  in  a 
real  world  situation. 


SUMMARY  OF  RESULTS 

We  have  processed  the  total  of  850  MBytes  of  text 
during  TREC-2.  The  first  550  MBytes  were  articles  from 
the  Wall  Street  Journal  which  were  previously  processed 
for  TREC-1;  we  had  to  repeat  most  of  the  processing  to 
correct  early  tokenization  errors  introduced  by  the 
tagger.  The  entire  process  (tagging,  parsing,  phrase 
extraction)  took  just  over  4  weeks  on  2  Sun's  SparcSta- 
tions  (1  and  2).  Building  a  comprehensive  index  for  the 
WSJ  database  took  up  another  2  weeks.  This  time  we 
were  able  to  create  a  single  index  thanks  to  the  improved 
indexing  software  received  from  NIST.  The  final  index 
size  was  204  MBytes,  and  included  2,274,775  unique 
terms  (of  which  a1x)ut  310,000  were  single  word  terms, 
and  the  remaining  1,865,000  were  syntactic  word  pairs) 
occurring  in  173,219  documents,  or  more  than  13 
(unique)  terms  per  document.  Note  that  this  gives  poten- 
tially much  better  coverage  of  document  content  than 
smgle  word  terms  alone  with  less  than  2  unique  terms  per 
document.  We  say  'potentially'  since  the  process  of 
deriving  phrase-level  terms  from  text  is  still  only  par- 
tially understood,  including  the  complex  problem  of 
'normalization'  of  representation. 

The  remaining  300  MBytes  were  articles  from  the 
San  Jose  Mercury  News,  which  were  contained  in  TIP- 
STER disk-3.  Processing  of  this  part,  and  creating  an 
index  for  routing  purposes  took  about  3  weeks.  While 
natural  language  processmg  required  2  weeks  to  com- 
plete (at  approximately  the  same  speed  as  WSJ  data- 
base), we  were  able  to  cut  indexing  time  in  half  by  using 
a  faster  in-memory  version  of  the  NIST  system.  This  new 
version  reduces  the  time  required  by  the  first  phase  of 
indexing  from  days  to  hours,  however  the  second  phase 
remains  slow  (days)  and  fragile  (we  had  to  redo  it  3 
times).  The  final  size  of  the  SJMN  index  was  101 
MBytes,  with  1,535,971  unique  terms  occurring  in 
86,270  documents  (nearly  18  unique-  terms  per  docu- 
ment).'^ 

Two  types  of  retrieval  have  been  done:  (1)  new 
topics  101-150  were  run  in  the  ad-hoc  mode  against  WSJ 
database,  and  (2)  topics  51-100,  previously  used  in 
TElEC-1,  were  run  in  die  routing  mode  against  SJMN 
database.  In  each  category  several  runs  were  attempted 


^  It  has  to  be  noted  that  the  ratios  at  which  new  terms  are  gen- 
erated are  nearly  identical  in  both  databases;  at  86,319  documents  (or 
about  half  way  through  WSJ  database)  1,335,622  unique  terms  had 
been  recorded. 
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with  a  different  combination  of  fields  from  the  topics 
used  to  create  search  queries.  A  typical  topic  is  shown 
below: 

<top> 

<head>  Tipster  Topic  Description 

<num>  Number:  107 

<doin>  Domain:  International  Economics 

<title>  Topic:  Japanese  Regulation  of  Insider  Trading 

<desc>  Description: 

Document  will  inform  on  Japan's  regulation  of  insider 
trading. 

<narr>  Narrative: 

A  relevant  document  will  provide  data  on  Japanese  laws, 
regulations,  and/or  practices  which  help  the  foreigner 
understand  how  Japan  controls,  or  does  not  control, 
stock  market  practices  which  could  be  labeled  as  insider 
trading. 

<con>  Concept(s): 

1.  insider  trading 

2.  Japan 

3.  Ministry  of  Finance,  Securities  and  Exchange  Council, 
Osaka  Securities  Exchange,  Tokyo  Stock  Exchange 

4.  Securities  and  Exchange  Law,  Article  58,  law, 
legislation,  guidelines,  self-regulation 

5.  Nikko  Securities,  Yamaichi  Securities,  Nomura  Securities, 
Daiwa  Securities,  Big  Four  brokerage  firms 

<fac>  Factor(s): 

<nat>  Nationality:  Japan 

</fac> 

<Aop> 

This  topic  actually  consists  of  two  different  statements  of 
the  same  query:  the  natural  language  specification  con- 
sisting of  <desc>  and  <narr>  fields,  and  an  expert- 
selected  list  of  key  terms  which  are  often  far  more  infor- 
mative than  the  narrative  part  (in  some  cases  these  terms 
were  selected  via  feedback  in  actual  retrieval  attempts). 
The  table  below  shows  the  search  query  obtained  from 
fields  <desc>  and  <narr>  of  topic  107,  after  an  expansion 
with  similar  terms  and  deleting  low -content  terms. 


Query  107 


standard+trade 

idf=16.81 

weight=0.38 

standard+trade 

idf=16.81 

weight=0.38 

regulate+japanese 

idf=15.40 

weight= 

1.00 

standard+japanese 

idf=14.08 

weight=0.38 

regulate+trade 

idf=12.84 

weight= 

1.00 

regulate+trade 

idf=12.84 

weight= 

1.00 

controls 

idf=9.97 

weight= 

1.00 

labele 

idf=9.20 

weight= 

1.00 

trade+inside 

idf=8.62 

weight= 

1.00 

trade+inside 

idf=8.62 

weight= 

1.00 

inside 

idf=6.49 

weight= 

1.00 

inside 

idf=6.49 

weight= 

1.00 

inside 

idf=6.49 

weight= 

1.00 

regulate 

idf=5.66 

weight= 

1.00 

regulate 

idf=5.66 

weight= 

1.00 

regulate 

idf=5.66 

weight=1.00 

practice 

idf=5.46 

weight=1.00 

practice 

idf=5.46 

wcight=1.00 

data 

idf=4.91 

weight=1.00 

data 

idf=4.91 

weight  :4).51 

data 

idf=4.91 

weight=0.26 

Japanese 

idf=4.84 

weight=1.00 

Japanese 

idf=4.84 

weight=1.00 

standard 

idf=4.81 

weighl=0.38 

standard 

idf=4.81 

weight=0.38 

standard 

idf=4.81 

weight=0.38 

inform 

idf=4.71 

weight=1.00 

inform 

idf=4.71 

weight=0.26 

inform 

idf=4.71 

weight=0.51 

protect 

idf=4.69 

weight  =0.41 

Note  that  many  'function'  words  have  been  removed 
from  the  query,  e.g.,  provide,  understand,  as  well  as 
other  'common  words'  such  as  document  and  relevant 
(this  is  in  addition  to  our  regular  list  of  'stopwords'). 
Some  stUl  remain,  however,  e.g.,  data  and  inform, 
because  these  could  not  be  uniformly  considered  as 
'common'  across  all  queries. 

Results  obtained  for  queries  using  text  fields  only 
and  those  based  primarily  on  keyword  fields  are  reported 
separately.  The  purpose  of  this  distinction  was  to  demon- 
strate that  (or  whether)  an  intensive  natural  language  pro- 
cessing can  make  an  imprecise  and  fiiequently  convo- 
luted narrative  into  a  better  query  that  an  expert  would 
create.^'* 

The  ad-hoc  category  runs  were  done  as  follows 
(these  are  the  official  TREC-2  results): 

(1)  nyuirl:  An  automatic  run  of  topics  101-150 
against  the  WSJ  database  with  the  following 
fields  used:  <tide>,  <desc>,  and  <narr>  only. 
Both  syntactic  phrases  and  term  similarities  were 
included. 

(2)  nyuirl:  An  automatic  run  of  topics  101-150 
against  die  WSJ  database  with  the  following 
fields  used:  <tide>,  <desc>,  <con>  and  <fac> 
only.  Both  syntactic  phrases  and  term  similarities 
were  included. 


Some  results  on  the  Impact  of  different  fields  in  TREC  topics 
on  the  final  recall/precision  results  were  reported  by  Broglio  and  Croft 
(1993)  at  the  ARPA  HLT  workshop,  although  text-only  runs  were  not 
included.  One  of  the  most  striking  observations  they  have  made  is  that 
the  narrative  field  is  entirely  disposable,  and  moreover  that  its  inclusion 
in  the  query  actually  hurts  the  system's  performance.  Croft  (personal 
communication,  1992)  has  suggested  that  excluding  aU  expert-made 
fields  (i.e.,  <con>  and  <fac>)  would  make  the  queries  quite  ineffective. 
Broglio  (personal  communication,  1993)  confirms  this  showing  that 
text-only  retrieval  (i.e.,  with  <desc>  and  <narr>)  shows  an  average  pre- 
cision at  more  than  30%  below  that  of  <con>based  retrieval. 
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(3)  nyuirS:  A  run  of  manually  pruned  tq)ics  101-150 
against  the  WSJ  database  with  the  following 
fields  used:  <title>,  <desc>,  <con>  and  <fac> 
only.  Both  syntactic  phrases  and  term  similarities 
were  included.  Manual  intervention  involved 
removing  some  terms  from  queries  before  data- 
base search. 

Summary  statistics  for  these  runs  are  shown  in 
Table  2.  In  addition,  the  'base'  colimm  reports  the 
system's  performance  on  text  fields  widi  no  language 
preprocessiag,  and  no  phrase  terms  or  similarities  used. 
We  must  note,  however,  that  in  all  cases  the  topics  have 
been  processed  with  our  suffix-trimmer,  which  means 
some  NLP  has  been  done  already  (tagging  +  lexicon), 
and  therefore  what  we  do  not  ^ee  here  a  performance  of 
'pure'  statistical  system. 

In  the  routing  category  only  automatic  runs  were 
done  (again,  these  are  the  official  TREC-2  results): 

(1)  nyuirl:  An  automatic  run  of  topics  51-100 
against  the  SJMN  database  with  the  following 
fields  used:  <titie>.  <desc>,  and  <narr>  only. 
Both  syntactic  phrases  and  term  similarities  were 
included. 

(2)  nyuirl:  An  automatic  run  of  topics  51-100 
against  the  SJMN  database  with  the  followmg 
fields  used:  <title>,  <desc>,  <con>  and  <fac> 
oidy.  Both  syntactic  phrases  and  term  similarities 
were  included. 

A  (simulated)  routing  mode  run  means  that  queries 
(i.e.,  terms  and  their  weights)  were  derived  with  respect 
to  a  (different)  training  database  (here  WSJ),  and  were 
subsequendy  run  against  the  new  database  (here  SJMN). 
In  particular,  this  means  that  the  terms  and  their  relative 
importance  (reflected  primarily  through  idf  weights) 
were  those  of  WSJ  database  rather  than  SJMN  database. 

Routing  runs  are  siunmarized  in  Table  3.  Again  a 
column  'base'  is  added  to  show  the  system's  perfor- 
mance without  NLP  module.  We  may  note  that  the  rout- 
ing results  are  generally  well  below  the  ad-hoc  results, 
both  because  the  base  system  performance  is  inferior  and 
because  query  processing  has  a  different  effect  on  the 
final  statistics.  The  last  colimm  is  a  post-TREC  run.^^ 


It  should  be  noted  that  in  category  B  runs,  three  topics  (63, 65, 
and  88)  had  no  relevant  documents  in  SJMN  database.  Unfortunately, 
the  evaluation  program  counts  those  as  if  there  were  relevant  documents 
but  none  had  been  found,  thus  underestimating  the  system's  perfor- 
mance by  5  to  8%.  Excluding  these  three  topics  from  consideration  we 
obtain,  in  the  last  column,  the  average  precision  of  0.2624  and  the  R- 
precision  of  0.3000. 


Run 

base 

nyuirl 

nyuir2 

nyuir3 

Name 

ad-hoc 

ad-hoc 

ad-hoc 

ad-hoc 

Queries 

50 

50 

50 

50 

Tot  number  of  docs  over  aU  queries 

Ret 

49887 

49884 

49876 

49877 

Rel 

3929 

3929 

3929 

3929 

RelRet 

2740 

2983 

3274 

3281 

Recall 

(interp)  Precision  Averages 

0.00 

0.7038 

0.7013 

0.7528 

0.7528 

0.10 

0.4531 

0.4874 

0.5567 

0.5574 

0.20 

0.3708 

0.4326 

0.4721 

0.4724 

0.30 

0.3028 

0.3531 

0.4060 

0.4076 

0.40 

0.2550 

0.3076 

0.3617 

0.3621 

0.50 

0.2059 

0.2637 

0.3135 

0.3142 

0.60 

0.1641 

0.2175 

0.2703 

0.2711 

0.70 

0.1180 

0.1617 

0.2231 

0.2237 

0.80 

0.0766 

0.1176 

0.1667 

0.1697 

0.90 

0.0417 

0.0684 

0.0915 

0.0916 

1.00 

0.0085 

0.0102 

0.0154 

0.0160 

Average  precision  over  aU  rel  docs 

Avg 

0.2224 

0.2649 

0.3111 

0.3118 

Precision  at 

5  docs 

0.4640 

0.4920 

0.5360 

0.5360 

10  docs 

0.4140 

0.4420 

0.4880 

0.4880 

IS  docs 

0.3867 

0.4240 

0.4693 

0.4707 

20  docs 

0.3670 

0.4050 

0.4390 

0.4410 

30  docs 

0.3253 

0.3640 

0.4067 

0.4080 

100  docs 

0.2304 

0.2720 

0.3094 

0.3094 

200  docs 

0.1626 

0.1886 

0.2139 

0.2140 

500  docs 

0.091 1 

0.1026 

0.1137 

0.1140 

1000  docs 

0.0548 

0.0597 

0.0655 

0.0656 

R-Precision  (after  RelRet) 

Exact 

0.2605 

0.3003 

0.3320 

0.3321 

Table  2.  Automatic  ad-hoc  run  statistics  for  queries  101-150  against 
WSJ  database:  (1)  base  -  statistical  terms  only  with  <desc>  and  <narr> 
fields;  (2)  nyuirl  -  using  syntactic  phrases  and  similarities  with  <desc> 
and  <narr>  fields  only;  (3)  nyuirl  -  same  as  2  but  with  <desc>,  <con>, 
and  <fac>  fields  only;  and  (4)  nyuir3  -  same  as  3  but  queries  manually 
pruned  before  search. 
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Run 

base 

nyuirl 

nyuir2 

nyuir2a 

Name 

routing 

routing 

routing 

routing 

Queries 

50 

50 

50 

50 

Tot  number  of  docs  over  all  queries 

Ret 

50000 

50000 

50000 

50000 

Rel 

2064 

2064 

2064 

2064 

RelRet 

1349 

1390 

1610 

1623 

Recall 

(interp)  Precision  Averages 

0.00 

0.5276 

0.5400 

0.6435 

0.6458 

0.10 

0.3685 

0.3937 

0.4610 

0.5021 

0.20 

0.3054 

0.3423 

0.3705 

0.4151 

0.30 

0.2373 

0.2572 

0.3031 

0.3185 

0.40 

0.2039 

0.2263 

0.2637 

0.2720 

0.50 

0.1824 

0.2032 

0.2282 

0.2379 

0.60 

0.1596 

0.1674 

0.1934 

0.1899 

0.70 

0.1167 

0.1295 

0.1542 

0.1571 

0.80 

0.0854 

0.0905 

0.1002 

0.1163 

0.90 

0.0368 

0.0442 

0.0456 

0.0434 

1.00 

0.0228 

0.0284 

0.0186 

0.0158 

Average  precision  over  all  rel  docs 

Avg 

0.1884 

0.2038 

0.2337 

0.2466 

Precision  at 

5  docs 

0.3160 

0.3360 

0.4280 

0.4440 

10  docs 

0.3100 

0.3240 

0.4000 

0.4180 

IS  docs 

0.2813 

0.2933 

0.3613 

0.3800 

20  docs 

0.2670 

0.2790 

0.3260 

0.3530 

30  docs 

0.2240 

0.2404 

0.2760 

0.2993 

100  docs 

0.1306 

0.1412 

0.1708 

0.1698 

200  docs 

0.0865 

0.0939 

0.1078 

0.1107 

500  docs 

0.0464 

0.0489 

0.0575 

0.0570 

1000  docs 

0.0270 

0.0278 

0.0322 

0.0325 

R-Precision  (after  Rel) 

Exact 

0.2196 

0.2267 

0.2513 

0.2820 

Table  3.  Automatic  routing  run  statistics  for  queries  51-100  against 
SJMN  database:  (1)  base  -  statistical  terms  only  with  <desc>  and 
<narr>  fields;  (2)  rtyuir]  -  using  syntactic  phrases  and  similarities  with 
<desc>  and  <narr>  fields  only;  (3)  nyuir2  -  same  as  2  but  with  <desc>, 
<con>,  and  <fac>  fields  only;  and  (4)  nyuir2a  -  run  nyuirl  repeated 
with  new  weighting  for  phrases. 

TERM  WEIGHTING  ISSUES 

Finding  a  proper  term  weighting  scheme  is  critical 
in  tmn-based  retrieval  since  the  rank  of  a  docimxent  is 


determined  by  the  weights  of  the  terms  it  shares  with  the 
query.  One  popular  term  weighting  scheme,  known  as 
tf.idf,  weights  terms  proportionately  to  their  inverted 
document  frequency  scores  and  to  their  in-document  fre- 
quencies (tf).  The  in-document  frequency  factor  is  asu- 
ally  normalized  by  the  document  length,  that  is,  it  is 
more  significant  for  a  term  to  occur  5  times  in  a  short 
20-word  document,  than  to  occur  10  times  in  a  1000- 
word  article. 

In  our  official  TREC  runs  we  used  the  normalized 
tf.idf  weights  for  all  terms  alike:  single  'ordinary-word' 
terms,  proper  names,  as  well  as  phrasal  terms  consisting 
of  2  or  more  words.  Whenever  phrases  were  included  in 
the  term  set  of  a  document,  the  length  of  this  document 
was  increased  accordingly.  This  had  die  effect  of 
decreasing  tf  factors  for  'regular'  smgle  word  terms. 

A  standard  tf.idf  weighting  scheme  (and  we 
suspect  any  other  uniform  scheme  based  on  frequencies) 
is  inappropriate  for  mixed  term  sets  (ordinary  concepts, 
proper  names,  phrases)  because: 

(1)  It  favors  terms  that  occiu'  fairly  fiequendy  in  a 
dociunent,  which  supports  only  general-type 
queries  (e.g.,  "all  you  know  about  'star  wars'"). 
Such  queries  are  not  typical  in  TREC. 

(2)  It  attaches  low  weights  to  infrequent,  highly 
specific  terms,  such  as  names  and  phrases,  whose 
only  occurrences  in  a  document  often  decide  of 
relevance.  Note  that  such  terms  cannot  be  reU- 
ably  distinguished  using  their  distribution  in  the 
database  as  the  sole  factor,  and  therefore  syntac- 
tic and  lexical  information  is  required. 

(3)  It  does  not  address  the  problem  of  mter-term 
dependencies  arising  when  phrasal  terms  and 
their  component  single-word  terms  are  all 
included  in  a  document  representation,  i.e., 
launch-^ satellite  and  satellite  are  not  indepen- 
dent, and  it  is  tmclear  whether  diey  should  be 
counted  as  two  terms. 

In  our  post-TREC-2  experiments  we  considered 
(1)  and  (2)  only.  We  changed  die  weighting  scheme  so 
that  the  phrases  (but  not  die  names  which  we  did  not  dis- 
tinguish in  TREC-2)  were  more  heavily  weighted  by 
their  idf  scores  while  the  in-dociunent  frequency  scores 
were  replaced  by  logarithms  multiplied  by  sufficiendy 
large  constants,  hi  addition,  the  top  N  highest-idf  match- 
mg  terms  (simple  or  compound)  were  counted  more 
toward  die  document  score  dian  die  remaining  terms. 
This  'hot-spot'  retrieval  option  is  discussed  in  die  next 
section. 


'*  This  is  not  always  true,  for  example  when  all  occurrences  of  a 
term  are  concentrated  in  a  single  section  or  a  paragraph  rather  than 
spread  around  the  article.  See  the  following  section  for  more  discussion. 
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The  table  below  illustrates  the  problem  of  weight- 
mg  phrasal  terms  Using  topic  101  and  a  relevant  docu- 
ment (WSJ870226-0091). 

Topic  101  matches  WSJ870226-0091 


duplicate  terms  not  shown 


TERM 

TFJDF 

NEW  WEIGHT 

sdi 

1750 

1750 

eris 

3175 

3175 

star 

1072 

1072 

wars 

1670 

1670 

laser 

1456 

1456 

weapon 

1639 

1639 

missile 

872 

872 

space+base 

2641 

2105 

interceptor 

2075 

2075 

exoatmospheric 

1879 

3480 

system+defense 

2846 

2219 

reentry+vehicle 

1879 

3480 

initiative+defense 

1646 

2032 

system+interceptor 

2526 

3118 

DOC  RANK 

30 

10 

Changing  the  weighting  scheme  for  compound  terms, 
along  with  other  minor  improvements  (such  as  expanding 
the  stopword  list  for  topics,  or  correcting  a  few  parsing 
bugs)  has  lead  to  the  overall  increase  of  precision  of 
nearly  20%  over  our  official  TREC-2  ad-hoc  results. 
Table  4  summarizes  these  new  runs  for  queries  101-150 
against  WSJ  database.  Similar  improvements  have  been 
obtained  for  queries  51-100. 

The  results  of  the  routing  runs  against  SJMN  data- 
base are  somewhat  more  troubling.  Applying  the  new 
weighting  scheme  we  did  see  the  average  precision 
increase  by  some  5  to  12%  (see  column  4  in  Table  3),  but 
the  results  remain  far  below  those  for  the  ad-hoc  nins. 
Direct  runs  of  queries  51-100  against  SJMN  database 
produce  results  that  are  about  the  same  as  in  the  routing 
runs  (which  may  indicate  that  our  routing  scheme  works 
fine),  however  the  same  queries  run  against  WSJ  data- 
base have  retrieval  precision  some  25%  above  SJMN 
runs.  This  may  indicate  some  problems  with  SJMN  data- 
base or  the  relevance  judgements  for  it. 

HOT  SPOT'  RETRIEVAL 

Another  difficulty  with  frequency-based  term 
weighting  arises  when  a  long  document  needs  to  be 
retrieved  on  the  basis  of  a  few  short  relevant  passages.  If 
the  bulk  of  the  document  is  not  direcdy  relevant  to  the 
query,  then  there  is  a  strong  possibiUty  that  the  document 
will  score  low  in  the  final  rankmg,  despite  some  strongly 
relevant  material  in  it.  This  problem  can  be  dealt  with  by 
subdividing  long  documents  at  paragraph  breaks,  or  into 
approximately  equal  length  fragments  and  indexing  the 
database  with  respect  to  these  (e.g.,  Kwok  1993).  While 


Run 

nyuirl 

nyuirl  a 

nyuir2 

nyuir2a 

Name 

ad-hoc 

ad-hoc 

ad-hoc 

ad-hoc 

Queries 

50 

50 

50 

50 

Tot  number  of  docs  over  aU  queries 

Ret 

49884 

50000 

49876 

50000 

Rel 

3929 

3929 

3929 

3929 

RelRet 

2983 

3108 

3274 

3401 

Recall 

(interp)  Precision  Averages 

0.00 

0.7013 

0.7201 

0.7528 

0.8063 

0.10 

0.4874 

0.5239 

0.5567 

0.6198 

0.20 

0.4326 

0.4751 

0.4721 

0.5566 

0.30 

0.3531 

0.4122 

0.4060 

0.4786 

0.40 

0.3076 

0.3541 

0.3617 

0.4257 

0.50 

0.2637 

0.3126 

0.3135 

0.3828 

0.60 

0.2175 

0.2752 

0.2703 

0.3380 

0.70 

0.1617 

0.2142 

0.2231 

0.2817 

0.80 

0.1176 

0.1605 

0.1667 

0.2164 

0.90 

0.0684 

0.1014 

0.0915 

0.1471 

1.00 

0.0102 

0.0194 

0.0154 

0.0474 

Average  precision  over  all  rel  docs 

Avg 

0.2649 

0.3070 

0.3111 

0.3759 

Precision  at 

5  docs 

0.4920 

0.5200 

0.5360 

0.6040 

10  docs 

0.4420 

0.4900 

0.4880 

0.5580 

15  docs 

0.4240 

0.4653 

0.4693 

0.5253 

20  docs 

0.4050 

0.4420 

0.4390 

0.4980 

30  docs 

0.3640 

0.3993 

0.4067 

0.4607 

100  docs 

0.2720 

0.2914 

0.3094 

0.3346 

200  docs 

0.1886 

0.2064 

0.2139 

0.2325 

500  docs 

0.1026 

0.1103 

0.1137 

0.1229 

1000  docs 

0.0597 

0.0622 

0.0655 

0.0680 

R'Precision  (after  Rel) 

Exact 

0.3003 

0.3332 

0.3320 

0.3950 

Table  4.  Automatic  ad-hoc  run  statistics  for  queries  101-150  against 
WSJ  database:  (1)  nyuirl  -  TREC-2  official  run  with  <desc>  and  <narr> 
fields  only;  (2)  nyuirla  -  revised  term  weighting  run;  (3)  nyuirl  - 
official  TREC-2  run  with  <desc>,  <con>,  and  <fac>  fields  only;  and  (4) 
nyuirla  -  revised  weighting  run. 

such  approaches  are  effective,  they  also  tend  to  be  costly 
because  of  increased  index  size  and  more  complicated 
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access  methods. 

Efficiency  considerations  has  led  us  to  investigate 
an  alternative  approach  to  the  hot  spot  retrieval  which 
would  not  require  re-indexing  of  the  existing  database  or 
any  changes  in  document  access.  In  our  approach,  the 
maximum  number  of  terms  on  which  a  query  is  permitted 
to  match  a  document  is  lunited  to  N  highest  weight 
terms,  where  N  can  be  the  same  for  all  queries  of  may 
vary  from  one  query  to  another.  Note  that  diis  is  not  the 
same  as  simply  taking  the  N  top  terms  from  each  query. 
Rather,  for  each  document  for  which  diere  are  M  match- 
ing terms  with  the  query,  only  min(M,N)  of  them, 
namely  those  which  have  highest  weights,  will  be  con- 
sidered when  computing  the  document  score.  Moreover, 
only  the  global  importance  weights  for  terms  are  con- 
sidered (such  as  idf),  while  local  in-document  frequency 
(eg.,  tO  is  suppressed  by  either  taking  a  log  or  replacing 
it  with  a  constant.  The  effect  of  this  'hot  spot'  retrieval  is 
shown  below  in  the  ranking  of  relevant  documents  within 
the  top  ICXX)  retrieved  documents  for  topic  65: 


Full  tf .idf  retrieval 

DOCUMENT  ID 

RANK 

SCORE 

WSJ870304-0091 

4 

12228 

WSJ891017-0156 

7 

9771 

WSJ920226-0034 

14 

8921 

WSJ870429-0078 

26 

7570 

WSJ870205-0078 

33 

6972 

WSJ8807 12-0033 

34 

6834 

WSJ9201 16-0002 

37 

6580 

WSJ910328-0013 

74 

4872 

WSJ910830-0140 

80 

4701 

WSJ890804-0138 

102 

4134 

WSJ91 1212-0022 

104 

4065 

WSJ870825-0026 

113 

3922 

WSJ880712-0023 

135 

3654 

WSJ871202-0145 

153 

3519 

Hot-spot  idf-dominated  with  N=20 


DOCUMENT  ID 

RANK 

SCORE 

WSJ920226-0034 

1 

11955 

WSJ8703O4-0091 

3 

11565 

WSJ870429-0078 

5 

9997 

WSJ9201 16-0002 

7 

9997 

WSJ910830-0140 

11 

8792 

WSJ870205-0078 

20 

8402 

WSJ910328-0013 

29 

8402 

WSJ8807 12-0033 

71 

6834 

WSJ8807 12-0023 

72 

6834 

WSJ891017-0156 

87 

6834 

WSJ890804-0138  92  6834 
WSJ91 1212-0022  111  6834 
WSJ87 1202-0 145        124  6834 

The  final  ranking  is  obtained  by  merging  the  two 
rankings  by  score.  While  some  of  the  recall  may  be 
sacrificed  ('hot  spot'  retrieval  has,  understandably,  lower 
recall  than  full  query  retrieval,  and  this  becomes  the 
lower  bound  on  recall  for  the  combined  ranking)  the 
combined  ranking  precision  has  been  consistently  better 
than  in  either  of  the  original  rankings:  an  average 
improvement  is  10-12%  above  the  tf.idf  run  precision 
(which  is  often  stronger  of  the  two). 


CONCLUSIONS 

We  presented  in  some  detail  our  natural  language 
information  retrieval  system  consisting  of  an  advanced 
NLP  module  and  a  'pure'  statistical  core  engine.  While 
many  problems  remain  to  be  resolved,  including  the 
question  of  adequacy  of  term-based  representation  of 
document  content,  we  attempted  to  demonstrate  that  the 
architecture  described  here  is  nonetheless  viable.  In  par- 
ticular, we  demonstrated  that  natural  language  processing 
can  now  be  done  on  a  fairly  large  scale  and  that  its  speed 
and  robustness  can  match  those  of  traditional  statistical 
programs  such  as  key-word  indexmg  or  statistical  phrase 
extraction.  We  suggest,  with  some  caution  imtil  more 
experiments  are  run,  tiiat  natural  language  processing  can 
be  very  effective  in  creating  appropriate  search  queries 
out  of  user's  initial  specifications  which  can  be  fre- 
quentiy  imprecise  or  vague. 

On  the  other  hand,  we  must  be  aware  of  the  limits 
of  NLP  technologies  at  our  disposal.  While  part-of- 
speech  tagging,  lexicon-based  stemming,  and  parsing  can 
be  done  on  large  amounts  of  text  (hundreds  of  millions  of 
words  and  more),  other,  more  advanced  processing 
involving  conceptual  structuring,  logical  forms,  etc.,  is 
still  beyond  reach,  computationally.  It  may  be  assumed 
that  these  super-advanced  techniques  will  prove  even 
more  effective,  since  they  address  the  problem  of 
representation-level  limits;  however  the  experimental 
evidence  is  sparse  and  necessarily  limited  to  rather  small 
scale  tests  (e.g.,  Mauldin,  1991). 
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1  Introduction 

The  CLARTT  team  used  the  opportunity  of  the  TREC- 

2  evaluations  to  explore  several  facets  of  the  CLARIT 
system.  In  particular,  given  the  performance  of  the 
CLARIT  system  on  TREC-1  tasks  (Evans  et  al.  1993),  we 
focused  our  attention  on  evaluating 

1.  fully-automatic  processing  of  topics  and  potentially- 
relevant  documents  and 

2.  topic/query  augmentation  using  CLARIT  thesaurus- 
discovery  techniques. 

All  of  the  results  we  report  in  this  paper  follow  from 
straightforward  applications  of  base-level  CLARIT  pro- 
cessing, utilizing  essentially  the  same  CLARIT  com- 
ponents that  were  employed  in  the  CLARIT-TREC- 

1  system.  The  general  improvements  we  observe  in 
CLARIT-TREC-2  processing  are  attributable  to  modifi- 
cations (especially  simplifications)  in  processing  steps 
and  in  the  settings  of  system  variables. 

In  the  following  sections,  we  describe  the  CLARIT- 
TREC-2  system,  report  our  official  processing  results, 
and  offer  a  brief  analysis  of  performance.  In  addition, 
we  report  on  several  subsequent  experiments  we  have 
conducted  on  the  TREC-2  collection  that  test  the  pa- 
rameters of  the  CLARIT-TREC-2  system  and  identify 
sources  of  irrunediate  improvements  in  processing. 

2  CLARIT-TREC-2  System  Description 
and  Processing  Method 

The  CLARIT-TREC-2  system  reflects  a  re-organization 
of  the  tools  and  techniques  employed  in  the  CLARIT- 
TREC-1  system.  One  of  our  principal  goals  was  to 
streamline  CLARIT  processing  and  to  establish  a  base- 
line method  that  is  amenable  to  parameterization  and 
analysis.  As  a  consequence,  the  flow  of  data  in  the 
CLARrr-TREC-2  system  is  simple,  straightforward, 
and  efficient;  furthermore,  all  CLARIT  processing  is 
fully  automatic. 


2.1    Changes  from  TREC-1 

The  essential  differences  between  the  CLARTT-TREC-l 
and  TREC-2  systems  are  in  the  preparation  and  evalua- 
tion of  queries  (TREC-2  "topics")  and  the  automation  of 
steps  designed  to  identify  and  process  potentially  rel- 
evant documents  for  use  in  query  augmentation.  The 
following  summaries  highlight  these  points. 

•  One-Pass  Querying.  The  CLARIT-TREC-1  sys- 
tem employed  a  two-step  process  to  retrieve 
documents — a  first  pass  for  partitioning  ("evok- 
ing") and  a  second  pass  for  final  ranking  ("dis- 
crimination"). This  has  been  eliminated  in  the 
CLARrr-TREC-2  system.  Querying  takes  place 
in  one  step  over  the  entire  collection  using  vector- 
space-retrieval  methods. 

•  Automatic  Query  Creation.  The  CLARTT-TREC- 
1  system  was  categorized  as  a  "manual"  system, 
though  the  required  manual  intervention  was  min- 
imal. In  particular,  users  were  expected  to  assign 
an  importance  coefficient  (with  possible  values  "1", 
"2",  or  "3")  to  the  CLARIT-parsed  terms  in  a  topic 
statement  and  possibly  also  to  add  terms  to  or 
delete  terms  from  the  CLARIT-generated  list.  In 
the  CLARIT-TREC-2  system,  the  importance  coef- 
ficient is  assigned  automatically  by  simple  heuris- 
tics (described  below).  While  users  are  still  free 
to  modify  coefficients  or  terms,  such  intervention 
is  not  required.  All  "CLARTA"  results  reported 
in  this  paper  reflect  processing  in  which  queries 
were  fully  automatically  prepared  by  the  CLARIT 
system,  without  review  or  modification. 

•  Automatic  Retrieval  Refinement.  When  pro- 
cessing ad-hoc  queries,  the  CLARTT-TREC-l  sys- 
tem required  that  the  user  evaluate  a  few  of  the 
top-ranked  retrieved  documents.  User-nominated 
documents  were  processed  to  identify  terms  for 
use  in  supplementing  the  source  query.  In  the 
CLARTT-TREC-2  system,  user  evaluations  are  not 
required.  Initial  querying  is  accturate  enough  to 
support  the  automatic  processing  of  the  highest- 
scoring  retrieved  docimtents  without  'inspection'. 
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•  Sub-Dociunent  Processing.  The  CLARTT-TREC- 
1  system  treated  all  documents  as  whole  texts; 
retrieval  'scores'  were  calculated  over  full  doc- 
uments. The  CLARIT-TREC-2  system  treats  all 
documents  as  collections  of  one  or  more  sub- 
documents,  operationalized  as  variable-sized  uruts 
of  approximately  paragraph  length.  Such  units  are 
used  as  the  basis  for  all  statistical  calculations  and 
for  measuring  'similarity'  to  a  query.  A  full  docu- 
ment is  assigned  the  score  {e.g.,  for  ranking)  of  the 
highest-scoring  sub-document  it  contains. 

2.2   Processing  Method 

Figure  1  offers  a  schematic  overview  of  processing  in 
the  CLARTT-TREC-Z  system.  All  topics  were  parsed 
for  noun  phrases.  These,  in  turn,  were  either  manually 
("CLARTM")  or  automaticaUy  ("CLARTA")  assigned 
weights  (values  "1",  "2",  or  "3")  for  'importance'.  The 
terms  for  each  topic  were  automatically  supplemented 
with  terms  from  a  (pseudo-)thesaurus,  automatically 
extracted  from  available  known-relevant  documents 
(in  the  case  of  routing  topics)  or  from  the  top-ranked 
sub-doamients  returned  in  a  first-pass  querying  of  the 
TREC-2  collection  (in  the  case  of  ad-hoc  topics).  All  in- 
stances of  retrieval  took  place  over  the  applicable  full  set 
of  documents,  which  had  undergone  an  initial  round 
of  CLARTT  processing  (parsing). 


The  CLARrr-TREC-2  system  incorporates  a  vector- 
space  retrieval  system  that  uses  several  CLARIT- 
specific  techniques  to  improve  retrieval  results.  The 
principal  techniques  involve  the  use  of  (1)  natviral- 
language  processing  to  identify  and  normalize  index- 
ing terms,  (2)  fully  automatic  query  augmentation 
based  on  CLARTT  thesaurus  discovery,  and  (3)  sim- 
ple text-analysis  heuristics  to  approximate  the  effect  of 
more  sophisticated  discourse  analysis  of  texts.  These 
techniques  are  described  in  greater  detail  in  the  follow- 
ing sections. 

2.2.1   Natural-Language  Processing 

CLARTT  natural-language  processing  (NLP)  encom- 
passes an  iiiflectional  morphological  analyzer  for  word 
recognition  and  normalization  and  a  deterministic  rule- 
based  parser  for  phrase  identification.  For  TREC-2  pro- 
cessing, only  simplex  noun  phrases  (MPs)  were  used. 
Simplex  NPs  are  phrasal  constitutents  that  include  the 
modifiers  and  head  novm(s)  of  an  NP  but  not  the  post- 
head  prepositional  phrases,  relative  clauses,  or  verb 
constructions.  The  CLARTT  parser  can  provide  a  more 
complex  linguistic  analysis  of  texts,  but  such  additional 
detail  was  not  used  in  TREC-2  experiments. 


Figure  1:  Overview  of  CLARIT-TREC-2  Processing 
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2.2.2  Query  Augmentation 

CLARTT  thesaurus  discovery  is  a  process  that  can  iden- 
tify a  set  of  core,  representative  terminology  in  a  col- 
lection of  documents.  If  the  documents  are  relatively 
homogeneous  topically,  the  resulting  'first-order'  the- 
saurus can  be  regarded  as  a  broad  'signature'  of  the 
topic  of  the  collection.  Typically,  the  procedure  is  most 
reliable  when  the  collection  is  large — 2  megabytes  or 
more  of  text.  In  the  CLARIT-TREC-2  system,  how- 
ever, thesaurus  discovery  was  used  on  the  relatively 
small  collections  of  relevant  documents  for  each  topic. 
For  each  routing  topic,  the  document  collection  used  to 
establish  a  first-order  thesaurus  consisted  of  the  train- 
ing set  of  relevant  documents  for  each  topic;  for  each 
ad-hoc  topic,  a  set  of  the  top-ranked  sub-documents 
from  among  the  documents  automatically  retrieved  via 
an  initial  pass  of  querying  over  the  corpus.  Terms  in 
the  discovered  thesaurus  were  used  to  supplement  the 
terms  in  the  source  query  (topic).  Thus,  in  the  case  of 
ad-hoc  topics,  the  procedure  represents  an  approach  to 
query  augmentation  based  on  fully  automatic  feedback. 

2.2.3  Text  Analysis  Heuristics 

Texts  can  have  complex  and  interesting  discourse  struc- 
ture, which  often  provides  clues  about  the  'topic(s)'  and 
the  'important'  information  in  a  document.  In  general, 
it  is  very  difficult  to  exploit  text-structure  information 
reliably  in  retrieval  tasks  over  large  collections  of  het- 
erogeneous documents.  Nevertheless,  the  CLARTT- 
TREC-2  system  applied  two  simple  processing  tech- 
niques to  topics  and  corpus  documents  to  attempt  to 
capture  ii\formation  encoded  in  rhetorical  structure. 

First,  all  training  and  corpus  documents  were 
divided  into  paragraph-sized  units  called  "sub- 
doaaments".  This  procedure  is  sensitive  both  to  the 
'normal'  demarcation  of  paragraphs  (successive  blank 
lines)  and  also  to  the  total  length  of  the  text  (measured  in 
numbers  of  sentences).  After  documents  are  partioned 
into  sub-dociunents,  the  sub-document  is  taken  as  the 
basic  imit  for  subsequent  processing,  viz.,  in  collecting 
statistics  and  in  scoring  retrieval. 

Second,  terms  extracted  from  a  topic  description 
were  assigned  importance  coefficients  based  on  their 
locations  in  the  topic  text.  Terms  found  in  the  first 
paragraph  are  given  a  weight  of  3;  terms  in  the  second 
paragraph,  2;  and  all  other  terms,  1. 

Of  course,  both  techniques  exploit  possibly  idiosyn- 
cratic characterisitics  of  the  TREC-2  processing  task. 
The  use  of  scoring  over  sub-docvunents  is  clearly  sen- 
sitive to  the  TREC  definition  of  docvunent  relevancy, 
viz.,  that  a  docxunent  is  relevant  regardless  of  length  if 
it  contains  a  single  relevant  sentence.  Furthermore,  au- 


tomatic term  importance  weighting  is  possible  only  be- 
cause of  the  formal  discourse  structure  of  TREC  topics; 
it  would  not  neccessarily  apply  to  other  presentations 
of  topics  or  queries  and  certainly  would  not  apply  to 
free  text  in  general. 

We  observe,  however,  that  there  is  no  one  set 
of  techniques  that  will  perform  optimally  in  every 
information-retrieval  (IR)  situation.  We  emphasize,  in- 
stead, that  one  important  measure  of  a  system's  utility 
and  adaptability  is  its  ability  to  take  advantage  of  ex- 
ploitable features  in  a  given  IR  task. 

2.3    System  Performance  Notes 

All  CLARIT-TREC-2  processing  took  place  on  DEC 
3000/400  (ALPHA/ AXP)  workstations  running  DEC 
OSF/L  One  system  had  128  megabytes  of  RAM,  one 
64  megabytes,  and  two  32  megabytes.  Realized  per- 
formance was  considerably  slower  than  would  be  ex- 
pected given  the  clockrate  (133.33  MHz)  of  the  DEC 
3000/400's  CPU.  In  fact,  because  of  suboptimal  com- 
pilers for  the  64-bit  architecture,  performance  was  per- 
haps two  times  slower  than  the  maximum  possible. 
One  pass  over  the  entire  TREC-2  collection  for  50  top- 
ics (processed  simultaneously)  required  approximately 
four  hours,  rvmning  on  the  four  machines  in  parallel. 
The  processing  of  a  single  topic  is  proportionally  faster, 
requiring  approximately  20  minutes  on  a  single  ma- 
chine. 

3  Results 

The  CLARTT  team  processed  all  the  topics  in  both  the 
"routing"  and  "ad-hoc"  categories  and  worked  with 
the  full  set  of  data.  Two  sets  of  results  were  submit- 
ted for  each  category,  corresponding  to  the  "manual" 
("CLARTM")  and  fully-"automatic"  ("CLARTA")  pro- 
cessing approaches  taken  with  the  topics. 

3.1   General  Summary  of  Official  Results 

Table  1  gives  the  official  CLARIT-TREC-2  system  rout- 
ing results  as  reported  by  NIST.  A  graph  of  the 
precision-recall  curves  for  the  two  sets  of  results  is 
given  in  Figure  2.  The  total  nvmiber  of  documents 
retrieved  under  the  routing  task  was  6,785  (CLARTM) 
and  6,811  (CLARTA),  representing,  respectively,  64.69% 
and  64.93%  of  the  total  known  relevants  (10,489). 
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Table  2  gives  the  official  CLARIT-TREC-2  system 
ad-hoc  query  results  as  reported  by  NIST.  A  graph 
of  the  precision-recall  curves  for  the  two  sets  of  re- 
sults is  given  in  Figure  3.  The  total  number  of  docu- 
ments retrieved  xmder  the  ad-hoc  query  task  was  8,229 
(CLARTM)  and  8,109  (CLARTA),  representing,  respec- 
tively, 76.30%  and  75.19%  of  the  total  known  relevants 
(10,785). 

The  graph  in  Figure  4  shows  the  average  precision 
score  for  each  process  at  N  docimients,  for  selected  val- 
ues of  N.  It  should  be  noted  that  the  maximum  possible 
precision  score  at  500  and  1,000  documents  is  less  than 
100%.  In  particular,  the  average  number  of  relevants 
per  routing  topic  is  209.78;  this  corresponds  to  a  maxi- 
mum precision  of  41.96%  at  500  documents  and  20.98% 
at  1,000  documents.  The  average  number  of  relevants 
per  ad-hoc  query  topic  is  215.70;  this  corresponds  to  a 
maximimi  precision  of  43.14%  at  500  documents  and 
21.57%  at  1,000  doctmients. 

Tables  3  and  4  provide  another  view  of  total  per- 
formance. The  numbers  in  each  cell  give  the  number 
of  times  the  CLARrr-TREC-2  system  produced  results 
above,  equal  to,  or  below  the  median  for  all  TREC- 
participant  systems.  Numbers  in  brackets  give  the 
instances  of  'extreme'  performance — ^best  and  worst — 
among  all  systems.  For  the  routing  topics,  for  example, 
CLARTT  retrieval  results  at  1,000  documents  were  bet- 
ter than  the  median  36  times  in  both  "manual"  and 
"automatic"  modes;  CLARTT  scored  the  maximum  10 
and  11  times,  respectively.  For  the  ad-hoc  query  topics, 
CLARTT  retrieval  results  at  1,000  documents  were  bet- 
ter than  the  median  44  times  in  "manual"  mode  and  42 
times  in  "automatic"  mode;  CLARTT  scored  the  maxi- 
mimi 4  and  9  times,  respectively. 

3.2   CLARIT  Automatic  vs.  Manual  Modes  of 
Processing 

In  both  tasks  (routing  and  ad-hoc  querying),  CLARTT- 
TREC-2  automatic  processing  results  are  virtually  iden- 
tical to  manual  results.  This  confirms  our  hypothesis 
that  the  principal  contribution  to  performance  derives 
from  (1)  the  base-level  CLARIT  process  (using  linguis- 
tic phrases  as  information  units)  and  (2)  the  effect  of 
query  augmentation  via  thesaural  terms.  On  this  lat- 
ter point,  we  note  that,  on  average,  the  final  query 
vector  for  a  topic  will  contain  many  more  terms  that 
derive  from  thesaurus  extraction  than  terms  that  de- 
rive from  the  source  topic.  In  general,  then,  when 
reliable  information  is  available  (as  in  sample  known 
relevants  or  highly-likely  relevants  returned  in  a  first- 
pass  retrieval),  the  CLARIT  process  will  succeed  in 
finding  good  supplemental  terminology  for  a  topic 


and  the  overall  effects  of  manual  intervention  will  be 
minimized.^ 

Figures  5  and  6  illustrate  the  relative  absence  of  a 
positive  effect  for  manual  intervention  in  the  selection 
and  weighting  of  query  terms.  There  are  approximately 
as  many  instances  of  decreased  performance  as  there 
are  instances  of  increased  performance.  Most  topics 
show  very  little  percentage  difference  in  numbers  of 
docvmients  returned;^  this  is  especially  underscored  in 
the  results  for  routing  topics  at  1,000  documents. 

4  Analysis 

4.1   CLARIT  Precision 

As  in  TREC-1,  CLARTT  precision-recall  curves  demon- 
strate very  high  precision  at  low  percentages  of  recall. 
The  first  few  documents  returned  by  the  system  are  ex- 
tremely likely  to  be  relevant  for  the  given  topic.  This 
fact  of  CLARIT  processing  was  successfully  exploited 
in  the  TREC-2  processing  method:  query  augmenta- 
tion was  possible  because  there  was,  in  general,  a  good 
concentration  of  topic-relevant  information  among  the 
sub-docvmients  of  the  first-pass  returned  doamients. 

As  shown  in  Figure  4,  precision  remains  quite  stable 
for  all  methods  across  the  first  30  documents  retrieved 
and  is  relatively  high  across  the  full  retrieved  set  of 
1,000  documents. 

5  Query  Augmentation  Experiments 

A  distinguishing  feature  of  the  CLARIT-TREC-2  sys- 
tem is  the  use  of  fully-automatic  query  augmentation. 
As  noted  above,  the  selection  of  terms  for  query  aug- 
mentation depends  on  (1)  the  selection  of  a  source  set  of 
known-  or  nominated-relevant  documents  and  (2)  the 
application  of  the  CLARIT  thesaurus-discovery  proce- 
dure. Since  the  size  (and  quality)  of  the  source  set 
of  documents  can  vary  and  since  CLARIT  thesaurus- 
discovery  processing  can  be  adjusted  to  nominate  rel- 
atively greater  or  fewer  numbers  of  terms,  the  'query- 
augmentation'  facet  of  the  CLARIT  process  is  a  natural 
source  of  potential  variation  in  system  performance. 


^  Of  course,  there  may  be  some  forms  of  manual  intervention — ^not 
utilized  in  the  CLARTT  'manual'  process — that  would  have  effects 
dramatically  better  than  the  CLARTT  automahc  process.  We  know  of 
no  such  process  that  can  be  applied  efficiently  to  arbitrary  topics  and 
databases,  however. 

^Indeed,  even  in  the  absolute  number  of  documents  returned  for 
each  topic — not  shown  in  the  figures — there  is  very  little  difference. 
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Offlolal  XFIEC-2  Roaults 


CI_AF=»l"r  TFlEC-2  :  Routing  F»©rfornnanc© 


O.O  O.l  0.2  0.3  0.4  0.6  0.6  0.7  O.S  O.O  1  .O 


Automatic  Flouting  <3>  1  OOO  Documonts  ■■  Average  =>  .3269 
K/lanual  Flouting  <S>  1  OOO  Oocuments  Average  =  . 3302 


Figure  2:  P-R  Curve  for  Routing  Over  1000  Docs— CLARTM  and  CLARTA 


CLARTM  Routing  Queries,  51-100 

CLARTA  Routing  Queries,  51-100 

Total  number  of  documents 

Total  number  of  documents 

over  all  queries 

over  all  queries 

Retrieved: 

50000 

Retrieved: 

50000 

Relevant- 

10489 

Relevant- 

10489 

Rel-ret- 

6785 

Rel-ret- 

6811 

Interpolated  Recall-Precision  Averages: 

Interpolated  Recall-Precision  Averages 

at  0.00  0.7791 

at  0.00  0.8101 

at  0.10  0.5825 

at  0.10  0.5781 

at  0.20  0.4987 

at  0.20  0.4941 

at  0.30  0.4505 

at  0.30  0.4368 

at  0.40  0.3848 

at  0.40  0.3851 

at  0.50  0.3295 

at  0.50  0.3358 

at  0.60  0.2739 

at  0.60  0.2749 

at  0.70  0.2109 

at  0.70  0.2074 

at  0.80  0.1683 

at  0.80  0.1561 

at  0.90  0.0951 

at  0.90  0.0933 

at  1.00  0.0126 

at  1.00  0.0141 

Average  Precision  (non-interpolated) 

Average  Precision  (non-interpolated) 

over  all  rel  docs 

0.3302 

over  all  rel  docs 

0.3269 

Precision 

Precision 

At  5  docs 

0.6120 

At  5  docs 

0.6200 

At  10  docs 

0.5940 

At  10  docs 

0.6060 

At  15  docs 

0.5800 

At  15  docs 

0.5893 

At  20  docs 

0.5700 

At  20  docs 

0.5730 

At  30  docs 

0.5527 

At  30  docs 

0.5460 

At  100  docs 

0.4396 

At  100  docs 

0.4346 

At  200  docs 

0.3436 

At  200  docs 

0.3416 

At  500  docs 

0.2176 

At  500  docs 

0.2184 

At  1000  docs 

0.1357 

At  1000  docs 

0.1362 

R-Precision  (precision  after  R  retrieved) 

R-Precision  (precision  after  R  retrieved) 

Exact  0.3642 

Exact  0.3646 

Table  1:  Offical  Results,  Routing  Topics  51-100 
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Official  TFIEC-2  Results 


CLARn     rF=*EC-2  :  Ad-Moo  F=>orf ormarBO© 


o.e 


o.o  0.1  0.2  o.a  0.4  0.6  0.6  0.7  o.a  o.s  1.0 

Recall 


Automatic  Ad-Moc  at  1  OOO  Documents 

Avaraga  =  .3S64  i 

Manual  Ad-Hoc  at  1  OOO  Ooc-.umants 

CD 

Average  =  .3383  i 

Figure  3:  P-R  Curve  for  Ad-Hoc  Queries  Over  1000  Docs— CLARTM  ar^d  CLARTA 


CLARTM  Ad-Hoc  Queries,  101-150 

Total  number  of  documents 
over  all  queries 

Retrieved:  50000 

Relevant:  10785 

Rel-ret;  8229 
Interpolated  Recall-Precision  Averages: 

at  0.00  0.7455 

at  0.10  0.5811 

at  0.20  0.5240 

at  0.30  0.4622 

at  0.40  0.4135 

at  0.50  0.3593 

at  0.60  0.2983 

at  0.70  0.2312 

at  0.80  0.1516 

at  0.90  0.0874 

at  1.00  0.0062 
Average  Precision  (non-interpolated) 

over  aU  rel  docs:  0.3383 
Precision: 

At  5  docs:  0.5840 

At  10  docs:  0.5740 

At  15  docs:  0.5640 

At  20  docs:  0.5590 

At  30  docs:  0.5433 

At  100  docs:  0.4846 

At  200  docs:  0.3975 

At  500  docs:  0.2601 

At  1000  docs:  0.1646 
R-Predsion  (precision  after  R  retrieved) 

Exact  0.3741 


CLARTA  Ad-Hoc  Queries,  101-150 
Total  number  of  documents 
over  all  queries 

Retrieved:  50000 

Relevant:  10785 

Rel-ret:  8109 
Interpolated  Recall-Precision  Averages: 

at  0.00  0.7587 

at  0.10  0.5676 

at  0.20  0.5038 

at  0.30  0.4489 

at  0.40  0.3938 

at  0.50  0.3455 

at  0.60  0.2806 

at  0.70  0.2172 

at  0.80  0.1463 

at  0.90  0.0827 

at  1.00  0.0070 
Average  Precision  (non-interpolated) 

over  all  rel  docs:  •  0.3264 
Precision: 

At  5  docs:  0.5800 

At  10  docs:  0.5680 

At  15  docs:  0.5547 

At  20  docs:  0.5490 

At  30  docs:  0.5253 

At  100  docs:  0.4644 

At  200  docs:  0.3874 

At  500  docs:  0.2542 

At  1000  docs:  0.1622 
R-Precision  (precision  after  R  retrieved) 

Exact  0.3645 


Table  2:  Offical  Results,  Ad-Hoc  Topics  101-150 
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Official  TREC-2  Rsaulta 


Precision  at  N  Docs 


1 


o.e  — 
o.e  — 
o.r  — 


Number  of  Documonts 


Manual  Routing  □  Automatic  Routing      ■  Manual  Ad-Hoc  CH  Automatic  Ad-Hoc 


Figure  4:  Comparative  Precision  at  N  Documents 


>  Median 

=Median 

<  Median 

CLARTM 

CLARTA 

CLARTM 

CLARTA 

CLARTM 

CLARTA 

Precision 

37  [1] 

34  [0] 

0 

4 

13  [0] 

12  [0] 

Rels  in  Top  100 

32  [2] 

29  [2] 

6 

8 

12  [1] 

13  [1] 

Rels  in  Top  1000 

36  [10] 

36  [11] 

5 

6 

9[0] 

8[0] 

Table  3:  Summary  of  Results  for  Routing  (51-100) 


>  Median 

=Median 

<  Median 

CLARTM 

CLARTA 

CLARTM 

CLARTA 

CLARTM 

CLARTA 

Precision 

38  [0] 

34  [3] 

4 

7 

8[0] 

9[0] 

Rels  in  Top  100 

39  [1] 

31  [0] 

3 

2 

9[0] 

17  [0] 

Rels  in  Top  1000 

44  [4] 

42  [9] 

2 

2 

4[0] 

6[0] 

Table  4:  Summary  of  Results  for  Ad-Hoc  Queries  (101-150) 
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Using  TREC-2  processing  results  as  a  baseline,  we 
have  begun  exploring  several  of  the  parameters  of 
query  augmentation  in  the  CLARIT-TREC-2  system. 
The  first  set  of  experiments  focuses  on  three  parame- 
ters: 

•  The  use  of  query  augmentation  compared  to  a  pro- 
cedure involving  no  query  augmentation; 

•  The  size  of  relevant  document  samples  as  source 
collections  for  thesaurus  extraction;  and 

•  The  'threshold'  chosen  for  the  thesaurus  discovery 
process. 

5.1   Query  Augmentation  vs.  No  Augmentation 

In  order  to  verify  that  CLARTT  query  augmentation 
techniques  do  have  a  positive  effect  on  retrieval  results, 
we  have  compared  the  official  TREC-2  results  against 
experimental  runs  that  do  not  take  advantage  of  any 
augmentation.  For  the  routing  queries,  we  re-ran  the 
routing  task  using  only  the  raw  queries  as  vectors  (con- 
sisting of  terms  from  the  topic  statement  only).  The 
effects  for  manual  and  automatic  modes  are  shown  in 
Figure  7.  In  the  case  of  the  ad-hoc  queries,  the  vmaug- 
mented  results  are  simply  the  results  of  the  initial  round 
of  querjdng,  before  automatic  feedback  has  taken  place. 
The  effects  for  both  modes  are  shown  in  Figure  8. 

Query  augmentation  using  known-relevant  docu- 
ments, as  in  the  routing  experiment,  has  a  dramatic  ef- 
fect on  system  performance.  The  manual  routing  task 
showed  a  69%  improvement  with  augmentation,  while 
automatic  routing  improved  by  76%.  Figure  9  shows 
the  effect  of  query  augmentation  for  individual  routing 
topics,  calculated  as  the  difference  in  average  precision 
between  augmented  and  unaugmented  queries.  Note 
that  very  few  queries  do  not  improve  with  augmenta- 
tion; most  show  great  improvement. 

Query  augmentation  is  not  as  effective  when  used 
for  automatic  feedback  in  processing  ad-hoc  queries, 
but  it  still  results  in  substantial  improvements.  For 
manual  ad-hoc  queries,  we  found  a  21%  improvement 
when  using  query  augmentation;  for  automatic  ad-hoc 
queries,  a  22%  improvement.  Obviously,  the  use  of 
known-relevant  docioments  in  the  case  of  routing  top- 
ics has  an  impact  on  query  augmentation,  but  it  is  not 
the  only  important  factor.  The  query-by-query  effect 
of  augmentation  for  ad-hoc  queries  is  shown  in  Fig- 
ure 10.  While  the  effect  is  not  as  great  as  with  routing 
topics,  most  queries  show  improvement  with  augmen- 
tation. By  comparing  the  query-by-query  results  for 
automatic  and  manual  modes,  one  can  see  that  manual 
intervention  in  query  formulation  improves  the  results 
of  augmentation  in  specific  cases,  such  as  query  121, 
but  the  positive  effect  is  difficult  to  predict.  Even  with- 


out user  review,  we  can  have  confidence  in  the  ability 
of  query  augmentation  to  improve  query  results,  given 
reasonably  accurate  initial  query  formulation. 

We  believe  that  the  techniques  we  used  in  CLARIT- 
TREC-2  automatic  feedback  can  be  refined  to  give  bet- 
ter documents  as  input  to  query  augmentation.  First, 
due  to  engineering  and  time  constraints  in  TREC-2 
processing,  we  did  not  select  the  best  sub-documents 
from  the  entire  corpus.  Instead,  we  selected  the  best 
sub-documents  from  a  pool  of  the  best  relevant  docu- 
ments, which  might  not  always  correspond  to  the  op- 
timal set  of  sub-documents  for  query  augmentation. 
Second,  the  TREC-2  process  used  an  absolute  num- 
ber of  sub-documents — the  top  N — in  query  augmen- 
tation, regardless  of  the  'similarity'  scores  of  those  sub- 
documents  to  the  source  query.  While  it  may  not  be  pos- 
sible to  determine  'absolute'  relevance  on  a  query-by- 
query  basis,  minimum  thresholds  might  be  applied  to 
exclude  clearly  irrelevant  sub-documents,  should  such 
be  within  the  N  otherwise  to  be  used  in  augmentation. 

5.2    Source  Sample  Size 

In  the  case  of  the  submitted  results  of  CLARIT-TREC-2 
processing,  the  thesaurus  for  each  routing  query  was 
extracted  from  the  set  of  all  its  known-relevant  train- 
ing documents.  However,  we  can  imagine  that  many 
such  documents  contain  some — perhaps  a  great  deal 
of — information  that  is  not  relevant  to  the  topic  at 
hand.  (This  is  especially  expected  with  long  docu- 
ments.) Therefore,  we  have  experimented  with  alterna- 
tive approaches  to  sampling  text  from  known-relevant 
documents.  In  particular,  we  have  used  a  ranking 
of  sub-documents  (paragraphs)  to  nominate  candidate 
'good'  relevant  texts  and  have  used  variable  numbers  of 
sub-documents  as  source  text  in  thesaurus  extraction. 

In  practice,  to  select  source  text  for  the  thesaurus- 
discovery  process  for  a  topic,  we  run  the  raw  topic 
vector  as  a  query  over  the  collection  of  relevant  docu- 
ments, which  are  partitioned  into  sub-docim\ents,  and 
return  only  the  top  A'^  relevant  sub-docviments.  (In  cases 
where  the  topic  has  fewer  than  A'^  sub-documents  in  the 
collection  of  relevants,  all  sub-documents  are  selected.) 
Our  hypothesis  is  that  thesauri  extracted  from  relevant 
sub-documents  will  contain  greater  numbers  of  'true- 
positive'  terms  related  to  the  topic.  We  have  run  the 
experiment  for  N  =  100,  200,  300,  500,  and  1000  as  the 
cutoff  for  selected  sub-documents.  Figure  11  gives  a 
sample  of  the  results  for  routing  topics,  including  the 
best  case  of  500  sub-documents. 

While  the  differences  for  different  N  are  not  great, 
certain  trends  are  evident.  First,  it  is  clear  that  using 
only  the  more  similar  sub-doctunents  from  a  collec- 
tion of  relevants  gives  better  results  than  using  aU  rele- 
vant full  documents.  (This  represents  an  important  im- 
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provement  over  the  method  employed  in  the  CLARTT- 
TREC-2  system.)  Second,  larger  sample  sizes  seem  to 
perform  better  than  smaller  ones.  In  the  experiments 
that  we  ran,  this  effect  peaks  at  500  sub-docimients,  but 
such  results  probably  interact  with  the  average  number 
of  relevant  sub-docttments  available.  As  noted  previ- 
ously, we  would  like  ideally  to  use  a  variable  number 
of  sub-documents  for  each  query,  but  such  an  approach 
requires  an  accurate  measure  of  'absolute'  relevance. 

5.3   Thesaurus  Size 

CLARTT  thesaurus  discovery  techniques  allow  for  the 
selection  (sampling)  of  greater  or  fewer  numbers  of 
terms  from  a  collection.  In  practice,  the  size  of  the 
sample  is  determined  by  a  real-number  threshold  be- 
tween 0.00  and  1.00.  A  larger  threshold  results  in  the 
inclusion  of  more  general  terminology  from  the  docu- 
ment collection;  a  smaller  threshold  results  in  selection 
of  terms  that  are  more  specific  to  the  collection.  (In- 
tuitively, such  variation  correlates  with  the  'breadth' 
or  'narrowness'  of  the  thesaurus.)  The  set  of  terms  se- 
lected at  a  larger  threshold  will  always  properly  include 
all  of  the  terms  that  would  be  selected  from  the  same 
collection  at  a  smaller  threshold.  In  CLARIT-TREC-2 
processing,  the  threshold  was  set  at  0.50.  Note,  how- 
ever, that  in  the  document-sample-size  experiments  re- 
ported above,  we  used  a  threshold  of  0.75. 

Using  sub-document  samples  of  relevant  docu- 
ments with  N  =  300  (reflecting  the  'second-best'  perfor- 
mance obtained  in  the  experiments  on  sample  size),  we 
have  explored  the  effects  of  using  different  thesaurus 
extraction  thresholds— at  0.50,  0.75,  0.85,  and  0.95  lev- 
els. Results  for  routing  topics  are  given  in  Figure  12. 

As  with  sample-size  variation,  differences  in  per- 
formance between  increments  are  not  dramatic,  but 
trends  do  appear.  First,  the  0.50  thesaurus  is  clearly 
inferior  and  actually  performs  very  much  like  the  base- 
line CLARIT-TREC-2  system.  Such  a  result  indicates 
that  much  of  the  variation  observed  between  base- 
line CLARIT-TREC-2  processing  and  our  current  'best' 
technique  is  due  to  changes  in  thesaurus  thresholds, 
rather  than  changes  in  the  document  sample  size.  Sec- 
ond, while  all  of  the  0.75,  0.85,  and  0.95  thesauri  have 
the  similar  precision  at  low  recall  levels,  the  0.75  the- 
saurus performs  slightly  better  in  the  average  case. 
From  this  we  might  hypothesize  that  the  0.75  value  is 
close  to  the  optimal  threshold  for  thesavirus  discovery 
in  the  context  of  query  augmentation. 

Figure  13  gives  the  results  for  the  automatic  routing 
task  for  the  optimal  number  of  sub-documents  and  the- 
saurus threshold  compared  to  TREC-2  reported  results, 
compared  to  the  unaugmented  baseline.  Here  we  can 
see  that  a  simple  refinement  to  parameter  setting  yields 


a  5.5%  overall  improvement  in  average  precision  and 
9.1%,  8.8%,  and  7.6%  improvement  in  precision  at  10%, 
20%,  and  30%  recall  levels,  respectively.  In  addition, 
there  is  a  6.3%  improvement  in  total  relevant  docu- 
ments (7,241  vs.  6,811).  Furthermore,  our  experiments 
suggest  mecharusms,  such  as  m.easures  of  absolute  rel- 
evance, that  might  result  in  further  significant  gains. 

6  Conclusion 

The  CLARIT-TREC-2  system  has  successfully  demon- 
stated  the  ability  to  operate  as  a  fully-automatic  IR 
system.  Since  the  performance  differences  between 
CLARIT  manual  and  automatic  processing  modes  are 
negligible,  one  can  use  CLARTT  in  fully-automatic 
mode  and  expect  high  precision  and  very  good  recall 
on  retrieval  tasks. 

The  TREC-2  results  also  demonstrate  the  efficacy  of 
the  CLARIT  technique  of  automatic  query  augmenta- 
tion. It  is  generally  difficult  for  a  user  to  predict  whether 
the  addition  of  terms  to  a  query  will  have  a  positive  or 
negative  effect  on  performance.  CLARTT  query  aug- 
mentation, using  CLARIT  thesaurus-discovery  tech- 
niques, however,  shows  positive  effects.  Because  the 
technique  is  fully  automatic,  it  can  be  applied  either  at 
the  time  of  query  formulation  (if  exemplary  relevant 
texts  are  known)  or  at  the  time  of  'first-pass'  retrieval. 
In  either  case,  final  results  will  be  improved. 

In  several  experiments,  we  have  already  identified 
two  simple  adjustments  to  CLARIT  parameters  that 
will  improve  performance  beyond  the  CLARIT-TREC- 
2  system  baseline.  The  system  is  not  yet  optimized;  we 
expect  to  make  other  straightforward  improvements. 

Many  text  processing  fvmctions  currently  available 
in  the  CLARIT  system  or  near  completion  were  not 
used  on  TREC-2  documents.  In  future  evaluations,  we 
plan  to  utilize  some  of  the  more  sophisticated  function- 
ality in  the  system.  For  example,  we  have  been  devel- 
oping grammars  for  recognizing  complex  tokens  such 
as  proper  names,  dates,  times,  monetary  values,  etc., 
but  did  not  use  token  recognition  modules  in  CLARIT- 
TREC  processing.  We  believe  that  such  token  recog- 
nition will  improve  the  results  for  qiieries  involving 
specific  persons  or  time  intervals.  Finally,  we  have 
also  been  experimenting  with  generating  sub-corpus- 
derived  equivalence  classes  for  words  and  terms.  We 
expect  to  use  equivalence  classes  selectively  to  supple- 
ment thesaurus  terms  in  query  augmentation. 

In  sum,  we  believe  that  CLARIT-TREC-2  process- 
ing results  demonstrate  the  power  of  CLARIT  tools  to 
solve  IR  tasks.  The  CLARIT-TREC-2  system  represents 
only  one  of  many  possible  configurations  of  CLARIT 
modules.  In  subsequent  work,  we  plan  on  exploring 
other  configurations. 
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Effect  of  Query  /Augmentation  :  Flouting  Quieries 


Figure  8:  Effect  of  Query  Augmentation  on  Ad-Hoc  Queries 
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Figure  11:  Effect  of  Sample  Size  (Routing  Topics) 


Figure  12:  Effect  of  Thesaurus  Size  (Routing  Topics) 
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1  Introduction 

Information  retrieval  can  be  viewed  as  an  evidential 
reasoning  problem.  Given  a  representation  of  a  document 
(e.g.,  the  presence  or  absence  of  selected  words  and 
phrases),  and  a  representation  of  an  information  need  (e.g., 
topics  of  interest),  the  problem  of  information  retrieval  is  to 
infer  the  degree  to  which  the  document  matches  the 
information  need.  Since  probability  theory  is  the  classical 
choice  for  automating  evidential  reasoning,  probabilistic 
approaches  to  information  retrieval  are  natural  and  have 
had  a  long  history,  starting  in  the  1960's  (Maron  &  Kuhns, 
1960). 

In  this  paper  we  describe  research  that  adapts  and  applies 
Bayesian  networks,  a  new  technology  for  probabilistic 
representation  and  inference,  to  information  retrieval.  The 
technology  has  substantial  advantages  over  older 
technologies  including  an  intuitive  representation  and  a  set 
of  efficient  inference  algorithms.  We  discuss  the  Bayesian 
network  technology  and  probabilistic  information  retrieval 
in  Section  2  of  this  paper. 

Our  research  is  directed  at  developing  a  probabilistic 
information  retrieval  architecture  that: 

•  is  oriented  towards  assisting  users  that  have  stable 

information  needs  in  routing  (i.e.,  sorting  through) 
large  amounts  of  time-sensitive  material, 

•  gives  users  an  intuitive  language  with  which  to  specify 

their  information  needs, 

•  requires  modest  computational  resources  (i.e.,  memory 

and  CPU  speed),  and 

•  can  integrate  relevance  feedback  and  training  data  with 

users'  judgements  to  incrementally  improve  retrieval 
performance. 

Towards  these  goals,  we  have  developed  a  system  that 
allows  a  user  to  specify:  multiple  topics  of  interest  (i.e., 
information  needs),  qualitative  and  quantitative 
relationships  between  the  topics,  document  features  that 
relate  to  the  topics,  and  quantitative  relationships  between 
these  features  and  the  topics.  The  system  runs  on  a 
Macintosh  n  computer  and  can  use  training  data  to  estimate 
any  of  the  quantitative  values  in  the  system.  We  discuss  the 
particular  methods  we  developed  and  used  in  our  system  in 
Section  3. 


We  participated  in  the  exploratory  group  (Category  B)  of 
the  1993  Text  Retrieval  Conference  (TREC-2),  sponsored 
by  the  National  Institute  of  Standards  and  Technology 
(NIST).  As  a  participant  in  the  exploratory  group,  we  were 
tasked  with  working  with  a  subset  of  the  TREC-2  training 
and  test  data.  Our  training  data  consisted  of  Wall  Street 
Journal  (WSJ)  articles  and  our  test  data  consisted  of  San 
Jose  Mercury  News  (SJMM)  articles.  We  chose  a  subset  of 
10  topics  out  of  the  50  TREC-2  routing  topics  to  best 
illustrate  the  methods  and  concepts  we  developed.  The 
choice  of  the  10  topics  was  reported  to  the  TREC 
coordinators  prior  to  our  training  runs  and,  of  course,  prior 
to  our  receipt  of  the  test  data.  We  generated  routing  queries 
for  each  of  the  10  chosen  topics,  trained  against  the  WSJ 
training  set  to  improve  our  queries,  and  tested  these  queries 
against  the  SJMN  articles  in  the  test  data  set. 

Our  system  was  developed  entirely  within  the  duration  of 
the  TREC-2  project  (January  93  to  June  93)  including  the 
document  handling,  feature  extraction,  inference,  and 
reporting  capabilities.  Our  TREC-2  effort  consisted  of  the 
two  authors.  We  describe  the  experimental  set-up  in 
Section  4  and  the  result  of  our  test  run  in  Section  5. 

We  are  very  encouraged  by  the  test  results  and  have  many 
ideas  for  future  research,  which  we  discuss  in  Section  6. 

2  Background 

In  this  section  we  describe  the  Bayesian  network 
technology  and  outline  the  previous  efforts  in  probabilistic 
information  retrieval. 

2.1  Bayesian  Networks 

While  probability  theory  provides  a  suitable  theoretical 
foundation  for  evidential  reasoning,  a  technology  based  on 
probability  theory  that  is  computationally  tractable  and  that 
includes  an  effective  methodology  for  acquiring  the  needed 
probabilistic  information  has  been  lacking.  Recent 
developments  in  Bayesian  networks  have  provided  these 
features.  As  the  name  suggests,  the  technology  is  based  on 
a  network  representation  of  probabilistic  information 
(Howard  &  Matheson,  1981;  Pearl,  1988). 

A  Bayesian  network  represents  beliefs  and  knowledge 
about  a  particular  class  of  situations.  The  use  of  Bayesian 
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networks  is  similar  to  expert  system  technologies.  Given  a 
Bayesian  network  (i.e.,  a  knowledge  base)  for  a  class  of 
situations  and  evidence  (i.e.,  facts)  about  a  particular 
situation  of  that  class,  conclusions  can  be  drawn  about  that 
situation.  The  technology  has  been  used  in  a  wide  variety 
of  situations,  including  medical  diagnosis,  military  situation 
assessment,  and  machine  vision. 

A  Bayesian  network  is  a  directed  acyclic  graph  where  each 
node  represents  a  random  variable  (i.e.,  a  set  of  mutually 
exclusive  and  collectively  exhaustive  propositions).  Each 
set  of  arcs  into  a  node  represents  a  probabilistic  dependence 
between  the  node  and  its  predecessors  (the  nodes  at  the 
other  end  of  arcs).  The  primary  technical  innovation  of 
Bayesian  networks  is  the  representation,  through  the 
network's  structure,  of  conditional  independence  relations 
between  the  variables  in  a  network.  This  innovation  not 
only  provides  an  intuitive  representation  to  acquire 
probabilistic  information  but  also  renders  inference 
tractable  for  large  numbers  of  real-world  situations. 

Inferences  can  be  drawn  from  a  Bayesian  network  with  a 
wide  variety  of  algorithms.  There  are  exact  algorithms 
(Lauritzen  &  Spiegelhalter,  1988;  Shachter,  1986;  Shachter, 
D'Ambrosio,  &  Del  Favero,  1990)  as  well  as  approximate 
algorithms  (Fung  &  Chang,  1989)  Inference  algorithms 
compute  a  probability  distribution  of  some  combination  of 
variables  in  the  network  given  all  the  evidence  represented 
in  the  network. 

Since  the  introduction  of  the  Bayesian  network  technology, 
several  efforts  have  been  made  to  apply  it  to  information 
retrieval  (Fung,  Crawford,  Appelbaum,  &  Tong,  1990; 
Turtle  &  Croft,  1990).  The  results  are  promising. 
However,  several  recent  innovations  have  been  made  in  the 
Bayesian  network  technology  that  have  not  yet  been 
applied  to  information  retrieval.  The  innovations  involve 
the  representation  of  conditional  independence  relations 
that  are  finer  than  those  currently  represented  at  the 
network  level.  One  of  these  innovations  is  Similarity 
Networks  (Heckerman,  1991)  developed  by  David 
Heckerman  at  Stanford  University.  Another  innovation  for 
representing  the  relationship  between  variables,  the  "co- 
occurrence diagram,"  was  developed  on  this  project. 

2.3  Probabilistic  Information  Retrieval 

Most  probabilistic  information  retrieval  techniques  can  be 
illustrated  graphically  through  the  use  of  Bayesian 
networks.  In  a  simple  formulation,  there  are  n  topics  of 
interest  (tj, ...  ,  tn)  and  m  identifiable  document  features 
(fl,  ...  ,  frn).  The  information  retrieval  problem  is  to 
compute  the  posterior  probability  p(tj  Ifj,  ... ,  fjn)  for  each 
topic,  given  a  quantification  (i.e.,  the  joint  probability  p(ti, 
...  ,  tn))  of  the  frequency  that  the  topics  appear  in  the  corpus 
and  a  quantification  (i.e.,  the  conditional  probability 
p(f  1, ...  ,  fml  ti,  ... ,  tn))  of  the  relationship  of  the 
"presence"  of  a  topic  in  a  document  and  the  "presence"  of 
features. 


Figure  2. 1  is  a  Bayesian  network  representing  a  retrieval 
model  with  one  topic  of  interest  and  three  features.  The 
node  t  represents  the  events  "the  document  is  relevant  to 
topic  t"  and  "the  document  is  not  relevant  to  topic  t."  The 
nodes  fj  represent  the  events  "the  feature  fj  is  present  in  the 
document"  and  "the  feature  fj  is  not  present  (is  absent)  in 
the  document."  The  prior  probabilities  of  the  events  t  and 
not-t  are  stored  in  the  t  node.  The  conditional  probabilities 
p(  fi  1 1 )  are  stored  in  each  of  the  feature  nodes. 


Figure  2.1:  The  two-level  Bayesian  network  model  of 
information  retrieval 

Because  of  the  lack  of  arcs  between  the  feature  nodes,  this 
diagram  embodies  the  assumption,  used  in  many 
probabilistic  systems,  that  the  features  are  conditionally 
independent  of  each  other  given  the  topic. 

Using  the  Bayesian  network  form  of  Bayes'  rule  called  arc 
reversal  (Shachter,  1986),  the  posterior  probability  of  the 
event  t  can  be  computed.  The  inversion  formulas  are 
straightforward  and  computationally  feasible  (they  are 
described  in  the  Appendix).  Figure  2.2  shows  the  network 
in  Figure  2.1  with  the  arcs  reversed.  It  represents  a  model 
in  which  knowing  whether  one  or  more  of  the  features  are 
present  provides  information  (i.e.,  the  posterior  distribution 
on  node  t)  about  whether  the  topic  is  relevant  to  the 
document. 


Figure  2.2:  The  Information  retrieval  model  after 
Bayesian  inversion 

To  address  multiple  topics  within  this  framework,  a  model 
similar  to  the  one  represented  by  Figure  2.1  would  be 
generated  for  each  topic,  and  inference  would  be  carried  out 
on  each  topic  separately.  However,  these  multiple  models 
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fail  to  represent  possible  relationships  between  the  topics. 
As  a  consequence,  the  acquisition  of  consistent  feature- 
topic  probabilities  and  the  interpretation  and  comparison  of 
multiple  topic  probabilities  are  problematic.  For  example, 
it  would  be  impossible  using  this  framework  to  compute  the 
probability  that  a  document  was  relevant  to  two  selected 
topics  or  to  compute  the  probability  that  a  document  was 
relevant  to  at  least  one  of  two  selected  topics. 

Bayesian  networks  can  be  used  to  explicitly  represent  the 
relationship  between  topics.  For  example,  consider  Figure 
2.3  in  which  the  two  topics  ti  and  t2  are  related. 


Figure  2.3:  Information  Retrieval  Model  with  Two 
Related  Topics 

The  Bayesian  inference  problem  with  multiple  topics 
becomes  more  complicated  than  before.  To  compute  the 
posterior  probability  of  each  topic,  all  the  other  topics  must 
be  removed  through  marginalization  as  well  as  reversing 
the  topic-feature  arcs  as  before.  In  addition,  any  joint 
distribution  between  the  topics  can  be  computed.  To 
perform  these  computations,  a  general  inference  algorithm 
is  needed. 

However,  there  is  a  way  to  simply  a  multiple  topic  network 
to  look  like  a  single  topic  network  (Figure  2.1).  The  topics 
can  be  combined  into  a  compound  topic  (node  s  in  Figure 
2.4)  (Chang  &  Fung,  1989)  whose  range  is  all  possible 
present-or-absent  combinations  of  tj  and  t2.  An  advantage 
of  this  representation  is  that  the  same  simple  computational 
formulas  can  be  used  as  before. 

A  disadvantage  of  this  representation  is  that  the 
intermediate  query,  s,  must  contain  (in  the  worst  case)  2" 
states,  representing  all  possible  present-or-absent 
combinations  of  its  n  parent  topics.  However,  this  worst 
case  is  rarely  seen,  because  the  relationships  between  topics 
are  typically  sparse.  Building  this  intermediate  query  is  one 
of  the  innovations  of  our  research  and  is  described  in 
Section  3.2. 


Figure  2.4:  Multiple-topic  query  represented  as  a  single 
compound- topic  query 


3        Retrieval  Architecture 

This  section  describes  the  information  retrieval  architecture 
we  have  developed. 

The  inputs  to  the  system  are  the  descriptions  of  topics  of 
interest  and  a  set  of  documents  to  test.  The  output  of  the 
system  is  list  of  documents  ranked  by  their  degree  of 
relevance  to  each  topic.  The  system  computes  the  degree 
of  relevance  as  the  probability  that  a  document  is  relevant 
to  a  topic,  for  every  document- topic  pair.  To  help  in  this 
task,  the  system  is  given  a  training  set  of  documents  to 
which  relevance  judgements  have  been  attached. 

There  are  several  components  to  our  retrieval  system.  The 
feature  extraction  component,  described  in  Section  3.1, 
translates  a  document  from  its  raw  text  form  to  a  simpler 
internal  representation  that  the  system  can  use.  The  query 
construction  component,  described  in  Section  3.2,  translates 
the  description  of  a  set  of  topics  into  an  internal 
representation.  The  document  scoring  component, 
described  in  Section  3.3,  uses  Bayesian  inference  to 
calculate  a  measure  of  a  document's  relevance  to  a  topic, 
given  the  internal  representations  of  both  document  and 
topic. 

3.1  Feature  Selection  and  Extraction 

Any  observable  characteristic  of  the  text  of  a  document  that 
may  be  clearly  defined  (e.g.,  as  either  present  or  absent) 
may  be  regarded  as  a  feature.  Our  system  looks  for  two 
types  of  features  in  the  text  of  a  document:  single  words 
and  pairs  of  adjacent  single  words.  If  a  feature  appears  at 
least  once  in  a  document,  it  is  counted  as  "present." 
Otherwise,  it  is  counted  as  "absent." 

The  internal  representation  of  a  document  is  therefore  a 
binary  vector,  each  element  indicating  the  presence  or 
absence  of  the  corresponding  feature  in  the  document. 

The  system  removes  many  common  suffixes  fi-om  a  word 
using  Paice's  stemming  rules  (Paice,  1977).  This  means, 
for  example,  that  if  either  of  the  words  "walking"  or 
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"walks"  is  present  in  a  document,  the  system  considers  the 
root  word  "walk"  to  be  present  in  the  document. 

The  system  must  be  given  a  list  of  features  for  which  to 
look.  This  target  list  was  constructed  in  three  steps.  First,  a 
list  of  candidate  features  was  generated  from  the 
descriptions  of  the  50  TREC-2  routing  topics.  The  text  of 
the  descriptions  was  broken  up  into  individual  words,  from 
which  the  suffixes  (if  any)  were  removed.  Duplicate  words 
and  common  words  (stop  words)  were  removed.  A  number 
of  two-word  features  (such  as  phrases  and  proper  names) 
were  identified  by  hand.  This  procedure  created  a  list  of 
about  1400  candidate  features. 

Second,  the  system  extracted  the  relative  frequency 
information  for  each  of  these  features  from  the  training 
documents.  Then,  for  each  topic  (the  10  plus  an  additional 
topic  representing  "none  of  the  above,")  the  system  sorted 
the  candidate  features  in  descending  order  according  to  the 
F4  formula  of  Robertson  and  Sparck  Jones  (Robertson  & 
Sparck  Jones,  1976),  which  we  used  is  a  measure  of  the 
ability  of  a  feature  to  characterize  a  topic. 

Third,  the  top  30  features  for  each  topic  (and  the  top  60 
features  for  "none  of  the  above")  were  combined  into  a 
single  list.  After  removing  duplicates,  this  yielded  the  final 
list  of  229  features. 

We  tried  numerous  feature  selection  strategies  and  settled 
on  this  one  as  the  most  satisfactory. 

3.2  Query  Representation  and  Construction 

In  preparation  for  the  inference  step,  the  10  single-topic 
queries  are  combined  into  one  multiple-topic  query.  As 
mentioned  in  Section  2.3,  this  aggregation  can  in  the  worst 
case  require  2     different  states  in  the  multiple-topic  query. 
In  this  section,  we  describe  how  to  reduce  the  number  of 
states  required  by  considering  the  relationships  between  the 
topics. 

Once  the  states  of  the  query  have  been  structured,  we  must 
assign  numerical  values  to  the  Bayesian  model.  We 
describe  how  to  generate  estimates  of  the  prior  probabilities 
of  each  of  these  states  in  Section  3.2.3.  We  describe  in 
Section  3.2.4  how  to  define,  for  each  feature  and  state,  the 
conditional  probability  that  the  feature  appears  in  a 
document  given  that  the  document  is  relevant  to  that  state. 


3.2.1  Pairwise  Relationships  Between  Topics 

There  are  six  possible  pairwise  relationships  between  any 
two  topics  t  J  and  t2' 

1.  ti  and  t2  are  mutually  exclusive:  if  there  is  no 
document  relevant  to  both  topics 

2.  tps  a  subset  of  t2  if  all  documents  relevant  to  tj  are 
also  relevant  to  t2 

3.  t2  is  a  subset  of  t  j  if  all  documents  relevant  to  t2  are 
also  relevant  to  tj 

4.  t^  and  t2  are  equivalent  if  tj  and  t2  are  subsets  of 
each  other  (they  satisfy  Relations  2  and  3) 

5.  tj  and  t2  are  dependent  if  knowing  that  a  document 
is  relevant  to  tj  gives  you  some  information  about 
whether  the  document  is  relevant  to  t2 

6.  1 1  and  t2  are  independent  if  knowing  that  a 
document  is  relevant  to  tj  gives  you  no  information 
on  whether  the  document  is  relevant  to  t2 

Each  of  the  Relations  1-5  is  a  type  of  dependence  between 
topics.  In  a  belief  network,  topics  satisfying  Relations  1-5 
would  be  connected  by  an  arc.  To  ensure  that  two  topics 
satisfy  only  one  of  the  relations,  Relation  5  (dependence)  is 
defined  as  any  type  of  dependence  that  is  different  from 
those  of  Relations  1-4. 

Relation  6,  independence,  is  represented  in  a  belief  network 
by  the  absence  of  an  arc  between  the  topics. 

The  distinction  between  dependence  (Relation  5)  and 
independence  (Relation  6)  is  useful  in  calculating  the 
probabilities  of  combinations  of  the  topics,  as  described  in 
Section  3.2.3. 

Relations  1-4  can  be  identified  by  testing  whether  the 
defining  condiuons  are  satisfied  in  the  training  set.  If  the 
topics  satisfy  none  of  the  first  four  relations,  then  a  chi- 
square  test  can  distinguish  between  Relations  5  and  6.  If 
there  are  too  few  documents  with  relevance  judgements  for 
both  topics  to  make  reliable  conclusions  (our  cutoff  was  13 
documents),  the  system  makes  an  assessment,  then  prompts 
the  user  to  verify  it  manually. 


To  fully  explore  the  pairwise  relationships  between  topics, 
we  selected  ten  of  the  fifty  TREC-2  routing  topics  to  use  in 
our  models  and  tests.  The  topics  are  listed  in  Table  3.1. 
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Topic 

Xonir  Dpsrrintion 

57 

Financial  Health  of  MCI 

61 

Israeli  Role  in  the  Iran-Contra  Affair 

74 

Conflicting  Policies  of  the  US  Government 

85 

Official  Corruption  in  any  government 

88 

Crude  Oil  Price  Trends 

89 

Downstream  mvestmg  by  OPEC  members 

90 

Data  on  Proven  Reserves  of  Oil  and  Gas 

97 

Fiber  Optic  Applications 

98 

Fiber  Optic  Equipment  Manufacturers 

99 

Iran-Contra  Affair 

Table  3.1:  List  of  the  10  topics  we  selected 

The  ten  topics  were  selected  to  provide  an  interesting 
mixture  of  relationships  among  the  topics.  Before  looking 
at  the  relevance  information  in  the  training  set,  we  assessed 
these  relationships  manually,  as  shown  in  Table  3.2.  The 
table  is  symmetrical  about  the  main  diagonal. 


57    61     74  85 


89    90    97    98  99 
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61 
74 
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88 
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90 
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d 

e 

sd 

sd 

sd 

i 

i 

e 

d  =  dependent 
s  =  subset 

sd  =  subset  or  dependent 

i  =  independent 

e  =  equivalent 

empty  =  mutually  exclusive 


Table  3.2:  Pairwise  relationships  between  10  TREC  topics, 
assessed  manually  before  looking  at  the  training  data 

Table  3.3  shows  the  topic  relationships  that  were  generated 
automatically  from  the  relevance  judgements  on  the 
training  documents.  Manual  verification  of  the  system's 
assessment  was  required  in  about  half  of  the  cases  (marked 
with  an  asterisk).  The  table  is  symmetrical  about  the  main 
diagonal. 
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61 
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d* 
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88 

* 

* 

e 

i 
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89 
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97 

d 

* 

* 

* 

* 

* 
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* 

98 

d 

* 

d* 

* 

d 

e 

* 

99 

* 

d 

d 

d 

* 

* 

* 

* 

* 

e 

*  =  manual  intervention  required 

Table  3.3:  Pairwise  relationships  determined  from  the  data 

The  system's  assessments  match  the  manual  assessments 
quite  well.  The  system  found  even  more  mutually 
exclusive  pairs  than  we  had  intuitively  thought.  One 
surprise  is  between  topics  61  ("Israeli  role  in  the  Iran- 
Contra  Affair")  and  topic  91  ("Iran  Contra  Affair").  One 
might  assume  that  61  is  a  subset  99,  but  there  are  indeed 
documents  in  the  training  set  that  are  relevant  to  61  but  not 
to  99.  Thus,  the  relation  between  61  and  99  is  dependence 
rather  subset. 

The  information  in  Table  3.3  can  be  expressed  as  a  directed 
graph,  as  shown  in  Figure  3.1.  Each  topic  is  a  node.  Two 
topics  are  connected  by  an  arc  if  there  is  at  least  one 
document  containing  them  both.  There  is  no  arc  between 
mutually  exclusive  topics.  The  arcs  marked  "i"  connect 
independent  topics,  the  unmarked  arcs  connect  topics  that 
are  dependent,  and  the  directed  arc  points  to  a  subset  from 
its  superset. 

Figure  3.1  is  not  a  belief  network.  It  may  be  considered  a 
"co-occurrence  diagram,"  since  topics  that  are  relevant 
together  (that  co-occur)  in  the  collection  are  connected. 


88 


89 


90 


Figure  3.1:  Relationships  diagram  among  10  TREC  topics 


1^ 
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3.2.2  Generating  states  from  the  topics 

The  states  of  the  multiple-topic  query  can  be  generated 
automatically  by  enumerating  all  possible  "relevant  or  not 
relevant"  combinations  of  topics,  subject  to  two  rules: 

1.  Two  mutually  exclusive  topics  cannot  both  be 
relevant  within  the  same  state. 

2.  A  topic  that  is  a  subset  of  another  cannot  be 
relevant  within  a  state  unless  its  superset  topic 
is  also  relevant. 

By  construction,  there  is  always  one  state,  U',  to  which  a 
document  is  relevant  if  it  is  relevant  to  none  of  the  topics. 

For  the  ten  topics,  these  rules  reduce  the  number  of  states 
drastically  from  the  theoretical  maximum  of  1024  =  2^^ 
states  to  the  actual  number  of  28  states. 

A  state  is  identified  by  listing  the  topics  to  which  a 
document  is  relevant  (or  not  relevant)  if  it  is  relevant  to  that 
state.  For  instance,  documents  relevant  to  the  state 
(61  -74  85  -99)  are  relevant  to  topics  61  and  85  and  are  not 
relevant  to  topics  74  and  99.  The  list  of  states  appears  in 
the  first  column  of  Table  3.4. 

3.2.3  Generating  State  Priors 

In  theory,  the  prior  probabilities  of  the  states  can  be 
calculated  from  their  relative  frequencies  in  the  training  set. 
However,  since  there  are  very  few  documents  in  the  TREC- 
2  training  set  that  were  evaluated  for  three  or  more  topics,  a 
different  way  to  estimate  the  priors  was  required. 

One  estimate  is  provided  by  factoring  the  states  prior  into 
products  of  the  priors  of  smaller  compound  topics.  This 
can  be  accomplished  without  manual  intervention.  For 
instance,  for  the  independent  topics  88,  89,  and  90,  only 
three  numbers  are  needed  (the  prior  probabilities  of  the 
three  topics  in  the  training  set)  to  compute  the  probabilities 
of  all  seven  states  containing  the  three  topics.  For  the  state 
(61  -74  85  -99),  two  numbers  are  needed:  the  prior 
probability  of  the  compound  topic  (-74  85  -99)  and  the 
probability  that  topic  61  is  relevant  given  that  topic  85  is 
relevant.  The  priors  obtained  by  this  method  are  shown  in 
the  second  column  of  Table  3.4.  They  are  expressed  as 
inverse  frequencies:  the  number  given  is  the  number  of 
weeks  (on  average)  between  articles  relevant  to  a  state, 
assuming  1000  documents  per  week. 


state 

Average 
Weeks  between 
Articles , 
Assessed 
Automatically 

Average 
Weeks  between 
Articles , 
Assessed 
Manually 

-88 

-89 

90 

<  1 

3 

-88 

89 

-90 

<  1 

3 

-88 

89 

90 

20 

7 

00  00 
00  00 

-89 
-89 

-90 
90 

<  1 
20 

2 
4 

88 

89 

-90 

40 

4 

88 

89 

90 

4000 

9 

57 

97 

98 

10 

8 

57 

97 

-98 

<  1 

5 

57 

-97 

98 

<  1 

6 

57 

-97 

-98 

<  1 

3 

-57 

97 

98 

1 

6 

-57 

97 

-98 

"  ' 

" 

-57 

-97 

98 

<  1 

3 

61 

74 

85 

99 

33 

20 

-61 

74 

85 

99 

14 

3 

61 

74 

85 

-99 

<  1 

50 

-61 

74 

85 

-99 

<  1 

3 

61 

-74 

85 

99 

<  1 

20 

-61 

-74 

85 

99 

<  1 

5 

61 

-74 

85 

-99 

5 

100 

-61 

O  J 

Q  Q 

2 

2 

74 

-85 

99 

<  1 

3 

74 

-85 

-99 

<  1 

2 

-74 

-85 

99 

<  1 

3 

57 

74 

<  1 

20 

85 

98 

1 

50 

U 

Table  3.4:  The  states  and  their  prior  probabilities 
expressed  as  frequencies  (assuming  1000  articles  per 
week)  generated  by  two  assessment  methods 


However,  these  priors  proved  unsatisfactory  and  required 
manual  override,  for  three  reasons.  First,  the  training  set  is 
a  very  biased  sample  of  the  WSJ.  The  training  documents 
were  selected  by  the  retrieval  systems  of  TREC-1  to  be 
intentionally  relevant  to  at  least  one  of  the  topics.  Thus,  the 
prior  derived  from  the  training  set  tends  to  be  a  gross 
overestimate  of  the  true  prior. 

Second,  the  time  period  of  the  training  set  (1987  to  1992)  is 
different  from  that  of  the  test  set  (only  1991).  The 
frequency  of  the  topics  relative  to  each  other  changes  over 
time.  For  instance,  there  are  many  fewer  articles  on  the 
Iran-Contra  affair  (relative  to  the  other  topics)  in  1991  than 
there  were  in  1987-1990.  We  adjusted  the  priors  to  match 
the  relative  frequencies  expected  in  the  test  set. 
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Third,  the  priors  are  widely  divergent  from  the  intuition  of 
any  user  who  reads  a  newspaper.  The  numbers  suggest,  for 
instance,  that  about  1  out  of  every  100  articles  in  the  WSJ 
(and,  by  analogy,  in  the  SJMN)  are  relevant  to  topic  57, 
"MCI."  Any  reader  of  the  WSJ  or  the  SJMN  knows  that 
this  estimate  is  much  too  high  -  the  relative  frequency  is 
probably  closer  to  1  in  500  or  1  in  1000  than  to  1  in  100. 

Thus,  the  prior  probabilities  were  assessed  manually.  It  is 
assumed  that  the  user  has  some  knowledge  of  the  test 
domain  (in  this  case,  articles  in  the  SJMN)  and  can  with 
some  thought  assess  the  relative  frequencies  of  various 
states  as  a  part  of  specifying  the  query  for  the  particular 
routing  request.  For  each  state,  we  ask  the  user  what  is  the 
average  number  of  weeks  between  the  publication  of 
articles  relevant  to  that  state.  This  number  is  presented  in 
the  third  column  of  Table  3.4.  It  can  be  converted  to  a  prior 
probability  by  combininbg  it  with  the  assumption  that  there 
are  1000  documets  per  week. 

The  prior  probability  of  U  ,  p(  U  ),  is  calculated  as  one 
minus  the  sum  of  the  priors  of  all  the  other  states,  to  ensure 
that  the  probability  of  all  states  together  is  one. 

3.2.4  Feature  Conditional  Probabilities 

The  inference  algorithm  requires,  for  each  feature  and  for 
each  state,  the  conditional  probability  of  the  feature  given 
the  state.  These  probabilities  cannot  be  obtained  directly 
from  the  relative  frequencies  obtained  from  the  training  set, 
because  there  are  few  documents  that  are  relevant  to  more 
than  one  topic  at  a  time. 

We  approximate  these  probabilities  by  using  a  structure 
called  a  noisy-or  gate. 

The  noisy-or  gate  combines  the  effects  of  two  or  more 
factors,  each  of  which  may  contribute  to  the  presence  of  a 
feature.  It  is  a  model  of  disjunctive  interaction,  as 
described  in  (Pearl,  1988).  It  has  been  used  in  medical 
decision  research  to  calculate  the  probability  of  a  particulai" 
symptom  being  present,  given  diseases  that  cause  the 
symptom  (Heckerman,  1989). 

In  the  context  of  information  retrieval,  a  feature  may  be 
present  due  to  any  of  the  topics  that  are  relevant  in  the  state. 
For  each  state-feature  pair,  we  build  a  noisy-or  model.  The 
contributing  factors  are  the  topics  that  are  relevant  within 
the  state.  The  effect  is  the  feature's  presence  or  absence  in 
the  document. 

For  example,  consider  a  feature  f  and  the  state  (57  -97  98). 
The  feature  may  be  present  due  to  topics  57  or  98.  It 
cannot  be  present  due  to  topic  97  because  that  topic  is  not 
relevant  within  the  state.  Let  El  be  the  event  that  the 
feature  is  present  due  to  topic  57,  and  let  E2  be  the  event 
that  the  feature  is  present  due  to  topic  98.  Table  3.5  lists  all 
the  possible  cases  of  the  two  uncertain  events.  Figure  3.4 
shows  the  belief  network  structure  of  the  noisy-or  model. 


The  node  with  the  double  wall  is  a  deterministic  logical  or 
gate. 


Feature 
Present  due  to 
57 
(El) 

Feature 
Present  due  to 
98 
(E2) 

Feature 
Present  at  all 
(El  OR  E2) 

Yes 

Yes 

Yes 

Yes 

No 

Yes 

No 

Yes 

Yes 

No 

No 

No 

Table  3.5:  Possible  cases  for  a  noisy-or  node 


Figure  3.4:  Belief  Network  corresponding  to  a  Noisy-Or 
Gate  Model 

The  only  case  in  Table  3.5  in  which  the  feature  is  absent  is 
the  fourth  case.  Thus,  the  conditional  probability  that  the 
feature  is  absent,  given  this  state,  is  the  probability  of  that 
fourth  case.  The  probability  that  the  feature  is  present  is 
one  minus  the  probability  that  the  feature  is  absent. 

3.3  Document  Scoring 

The  Bayesian  inversion  described  in  Section  2.1  yields,  for 
each  document,  the  posterior  probability  that  the  document 
is  relevant  to  each  state.  We  calculate  the  posterior 
probability  that  the  document  is  relevant  to  each  topic  by 
summing  the  posterior  probabilities  of  all  of  the  states  in 
which  the  topic  appears. 

For  example,  the  posterior  probability  for  topic  57  is  the 
sum  of  the  posterior  probabilities  of  the  five  states  in  which 
topic  57  is  relevant  (refer  to  Table  3.4).  The  states  are  (57 
97  98),  (57  97  -98),  (57  -97  98),  (57  -97  -98  -74)  and 
(57  74). 

The  final  list  of  documents  for  each  topic  contains  the  top 
1000  documents,  ranked  in  descending  order  according  to 
the  posterior  probability  that  they  are  relevant  to  that  topic. 
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4        Experimental  Environment 

4.1  Hardware  and  Software  Environment 

The  system  was  developed  by  one  programmer  over  the 
period  of  six  months.  It  was  written  in  C  using  the 
Symantec  Think  C  5.0  compiler  under  Macintosh  System 
7.0.1.  The  experiments  were  run  on  two  machines, 
depending  on  availability:  a  Macintosh  Ilfx  and  a  Centris 
650.  Both  machines  had  8  Mb  of  memory  and  a  500  Mb 
hard  disk. 

4.2  Summary  of  System  Data  Structures 

Table  4. 1  is  a  list  of  the  data  structures  used  by  the  system. 
Each  is  listed  with  the  sections  of  this  paper  in  which  it  is 
described  and  with  an  indicator  of  whether  it  was  provided 
by  NIST  (N),  assessed  manually  by  us  (M),  or  generated 
automatically  by  the  system  (A). 

Data  Structure.  (Sections  Source") 

Descriptions  of  Routing  Topics  (3.1,  N) 

Training  Documents  (3.1,  N) 

Candidate  Feature  List  (3.1,  A/M) 

Final  Feature  List  (3.1,  4.3.2,  A) 

Relevance  Judgements  on  the  Training  Documents 

(3.2.1,  N) 
Topic  Relationships  (3.2.1,  A/M) 
State  Prior  Probabilities  (3.2.2,  4.3.2,  M) 
Feature  Conditional  Probabilities  (3.2.4,  A) 
Test  Documents  (3.3,  N) 
Document  Posterior  Probabilities  (3.3,  A) 
Final  list  of  retrieved  documents  (3.3,  A) 

Table  4. 1 :  Data  Structures  used  by  the  System 

The  candidate  feature  list  could  be  defined  entirely 
automatically  by  sufficiently  intelligent  procedures  such  as 
phrase  and  proper  name  identification.  The  state  prior 
probabilities  are  the  only  data  items  that  must  be  assessed 
manually. 

The  data  files  needed  by  the  system,  other  than  those 
provided  by  NIST,  occupy  about  100  Kb  on  disk. 

4.3  Training  Phase 

4.3.1  Building  tlie  Data  Structures 

We  are  a  Category  B  participant  in  TREC-2.  As  such,  we 
used  a  subset  (just  the  WSJ  articles)  of  the  full  training 
collection.  Of  these  we  used  only  those  (roughly)  29,000 
articles  for  which  there  is  some  relevance  judgment 
available.  In  addition,  because  we  considered  only  ten  of 
the  TREC-2  routing  topics,  we  used  only  the  5941 
documents  that  have  a  relevance  judgment  on  at  least  one 
of  the  ten  topics. 


The  inputs  to  a  training  run  are  the  candidate  feature  list, 
the  list  of  topic  relationships,  the  training  documents,  and 
the  training  relevance  judgements.  The  outputs  of  a 
training  run  are  the  final  list  of  features  and  the  feature 
conditional  probabilities. 

The  time  to  complete  a  training  run  is  about  1.5  hours. 
4.3.2  Defining  the  Query 

The  design  of  the  TREC-2  experiment  required  that  before 
we  receive  the  test  data,  we  submit  the  definitions  of  our 
system  and  of  our  particular  query  to  NIST.  The  query 
definition  includes  the  final  feature  list,  the  list  of  topic 
relationships,  and  the  state  prior  probabilities.  All  of  the 
other  data  is  determined  automatically,  as  shown  in 
Table  4.1. 

We  ran  about  two  dozen  sample  experiments  that  applied 
the  system  to  the  training  documents  to  gauge  its 
performance  with  different  query  definitions.  In  these  tests, 
we  varied  only  two  of  the  data  inputs,  the  state  prior 
probabilities  and  the  final  feature  list.  The  list  of  topic 
relationships  remained  constant  throughout  these  tests. 

It  took  much  longer  than  expected  to  finish  programming 
the  system.  Thus,  all  of  the  experiments  to  define  the  query 
were  completed  in  the  two  weeks  immediately  preceding 
the  final  submission  of  our  query  definition  to  NIST. 

4.4  Test  Phase 

As  a  Category  B  participant  in  TREC-2,  we  ran  our  routing 
experiment  on  just  the  SJMN  articles. 

The  inputs  to  the  test  runs  were  the  final  list  of  features,  the 
feature  conditional  probabilities,  the  list  of  topic 
relationships,  and  the  test  documents.  The  output  of  the  test 
run  is  the  final  list  of  retrieved  documents. 

A  test  run  takes  about  5.5  hours.  This  includes 
decompressing  the  150  MB  of  test  data,  one  file  at  a  time, 
when  it  is  needed. 


5  Results 

The  TREC-2  designation  for  our  system  is  "idsra2."  Figure 
5.1  shows  the  precision-recall  curves  for  the  ten  topics  we 
considered,  excluding  topic  88,  for  which  no  TREC-2 
participant  found  any  relevant  documents. 

Table  5.1  shows  several  measures  of  performance  for  our 
results,  along  with  the  best,  median,  and  worst  values  for 
those  measures  among  all  TREC-2  participants.  We  again 
exclude  topic  88. 
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Figure  5.1:  Precision  vs.  Recall  for  system  idsra2 


Topic 
Number 

Relevant  Retrieved  at 

100 

idsra2 

Best 

Median 

Worst 

57 

17 

18 

17 

0 

61 

19 

19 

9 

0 

74 

6 

16 

6 

0 

85 

33 

54 

33 

1 

89 

2 

3 

2 

0 

90 

1 

1 

0 

0 

97 

25 

28 

18 

1 

98 

24 

24 

17 

0 

99 

60 

60 

52 

0 

Table  5.1a:  Relevant  documents  in  the  top  100  retrieved, 
for  idsra2  and  for  all  systems 


Topic 
Niimber 

Relevant  Retrieved  at 

1000 

idsra2 

Best 

Median 

Worst 

57 

18 

19 

18 

0 

61 

24 

25 

24 

0 

74 

11 

31 

11 

1 

85 

88 

115 

88 

2 

89 

2 

4 

2 

0 

90 

1 

3 

0 

0 

97 

27 

32 

27 

1 

98 

26 

29 

26 

0 

99 

70 

70 

66 

0 

Table  5.1b:  Relevant  documents  in  the  top  1000  retrieved, 
for  idsra2  and  for  all  systems 


Topic 
Number 

Average 

Precision 

idsra2 

Best 

Median 

Worst 

57 

0 

387 

0 

460 

0 

374 

0  .  000 

61 

0 

464 

0 

464 

0 

083 

0.000 

74 

u 

n  n  R 

U  U  0 

n 
u 

m  /I 

U  /  4i 

U 

n  n  p 
u  u  o 

0.000 

85 

0 

174 

0 

353 

0 

174 

0.000 

89 

0 

081 

0 

259 

0 

077 

0  .  000 

90 

0 

025 

0 

025 

c 

000 

0  .  000 

97 

0 

383 

0 

383 

0 

202 

0  .  002 

98 

0 

282 

0 

427 

0 

334 

0  .  000 

99 

0 

700 

0 

700 

0 

509 

0  .  000 

Table  5.1c:  Average  precision  (as  defined  for  TREC-2), 
for  idsra2  and  for  all  systems 


6        Conclusions  and  Future  Directions 

We  believe  that  we  have  made  significant  progress  to 
developing  an  information  retrieval  architecture  that: 

•  is  oriented  towards  assisting  users  with  stable 

information  needs  in  routing  large  amounts  of  time- 
sensitive  material 

•  gives  users  an  intuitive  language  with  which  to  specify 

their  information  needs 

•  requires  modest  computational  resources,  and 

•  can  integrate  relevance  feedback  and  training  data  with 

users'  judgements  to  incrementally  improve  retrieval 
performance. 

We  are  encouraged  by  the  test  results.  We  have  not  had 
very  much  time  to  analyze  the  results  but  we  intend  to  try  to 
understand  why  we  did  very  well  on  some  topics  and  not  so 
well  on  others.  Very  preliminary  analysis  suggests  that  the 
features  for  the  topics  in  which  we  did  well  (e.g.,  61  and 
99)  were  much  more  informative  than  the  ones  on  which 
we  did  very  poorly  (e.g.,  74). 

We  have  many  ideas  for  future  research.  These  ideas  fall 
into  three  basic  categories:  probabilistic  representation, 
user  interface,  and  inference  methods. 

The  most  important  improvements  we  would  like  to  make 
are  in  the  category  of  probabilistic  representation  of  the 
topic  and  the  document.  One  research  goal  is  to  develop  a 
way  to  intuitively  represent  relationships  between  features. 
Also,  we  would  like  to  explore  more  sophisticated  feature 
extractors  that  recognize  phrases,  synonyms,  and  features 
derived  from  natural  language  processing.  We  believe  that 
achievement  of  these  goals  could  lead  to  significant 
improvements  in  performance. 
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In  conjunction  with  these  representational  improvements, 
we  would  like  to  design  a  complete  user  interface  that 
would  allow  users  to  make  the  needed  probabilistic 
judgements  easily  and  intuitively.  We  would  also  like  to 
develop  an  explanation  facility  that  could  describe  why  one 
document  was  preferred  over  another. 

There  are  new  exact  and  inexact  algorithms  that  could 
handle  these  representational  modifications.  We  would  like 
to  experiment  with  these  algorithms  to  see  if  they  are 
suitable.  We  would  also  like  to  implement  a  relevance 
feedback  mechanism  based  on  the  Bayesian  concept  of 
equivalent  sample  size. 

Finally,  while  the  information  retrieval  problem  has  been 
viewed  primarily  (and  rightly  so)  as  an  evidential  reasoning 
problem,  we  take  the  position  that  a  decision-theoretic 
perspective  is  more  accurate  since  information  retrieval  is  a 
decision  process.  We  believe  that  this  perspective  can 
provide  additional  insight  and  eventually  improve  upon  the 
probabilistic  approaches  that  have  been  developed. 
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Appendix  Bayesian  Inference 

We  calculate  the  posterior  probability  that  a  document  is 
relevant  to  a  state  using  Bayes'  rule  and  an  assumption  of 
conditional  independence  between  the  features. 

The  Bayesian  inversion  formula  is  this: 

p(  s  I  document )  =  p(  s  I  F  )  =     ^      f  ^  ^  ^  , 

P(  F  ) 

where  s  is  the  state,  and  F  is  a  binary  feature  vector  with 
Fj  =  1  if  feature  i  is  present  ( the  event  fj"^  ) 

Fj  =  0  if  feature  i  is  absent  ( the  event  fj  ). 

The  set  of  all  features  that  are  present  is  denoted  F"*".  All 
other  features  are  absent  and  are  in  the  set  denoted  F". 

The  first  term  in  the  numerator  is  expanded  under  the 
assumption  that  the  features  are  conditionally  independent 
given  the  state. 

p(  F  I  s  )  =  f7  P(      '  s  )  IT  P(  ^" '  s  ) 

i  in  F^  i  in  F' 

The  second  term  in  the  numerator  is  the  state  prior,  p(  s  ), 
which  can  be  estimated  as  described  in  Section  3.2. 


The  denominator  is  a  normalization  constant  obtained  by 
summing  over  the  values  for  the  numerator  for  all  the 
states. 

P(F)=  I  p(Fls)p(s) 
all  s 

p(  F  )  is  the  probability  that  one  would  observe  the 
particular  set  of  features  F. 
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Abstract 

A  new  retrieval  method  together  with  a  new  access 
structure  is  presented  that  is  aimed  at  a  high  update 
efficiency,  a  high  retrieval  efficiency  and  a  high  retrieval 
effectiveness.  The  access  structure  consists  of  signatures 
and  non-inverted  descriptions.  This  access  structure  can 
be  updated  efficiently  because  the  description  of  a  single 
document  is  stored  in  a  compact  form.  The  signatures 
are  used  to  compute  approximate  retrieval  status  values 
first,  and  the  non-inverted  descriptions  are  then  used  to 
determine  the  final  list  of  documents  ranked  by  the  ex- 
act retrieval  status  values.  Our  basic  approach  based 
on  the  standard  tf  *  idf  weighting  scheme  has  been  im- 
proved in  in  both  retrieval  effectiveness  and  retrieval  ef- 
ficiency. On  an  average,  the  time  for  retrieving  the  top 
ranked  document  is  clearly  below  two  seconds  while  the 
document  collection  can  be  updated  in  10  msec,  (insert- 
ing, deleting,  or  modifying  a  document  description). 

1  Introduction 

Information  retrieval  systems  are  more  and  more  used 
to  retrieve  information  from  large  and  dynamic  data 
collections,  e.g.  in  office  automation  or  in  newscast- 
ing  companies.  Since  in  such  environments,  the  data 
collections  may  have  to  be  updated  many  times  every 
second,  the  update  efficiency  is  an  important  evaluation 
criterion  that  should  be  taken  into  account  in  addition 
to  the  traditional  evaluation  criteria  (retrieval  effective- 
ness, retrieval  efficiency,  etc.)  [10,  pp.  161],  [12,  pp.  144]. 

Before  introducing  our  approach,  we  show  that  the 
update  efficiency  is  conflicting  with  the  retrieval  effec- 
tiveness and  with  the  retrieval  efficiency.  In  the  case  of 
static  data  collections,  the  retrieval  effectiveness  crite- 
rion is  usually  met  by  means  of  weighted  retrieval  in- 
cluding relevance  feedback  [4],  [9]  whereas  the  retrieval 
efficiency  criterion  is  met  by  precomputing  the  docu- 
ment descriptions  and  by  organizing  these  descriptions 
as  an  inverted  file  [5].  In  the  case  of  dynamic  data  collec- 
tions, however,  a  transaction  updating  the  inverted  file 


may  block  other  retrieval  and  update  transactions  for  an 
unacceptable  long  time  because  adding  the  description 
of  a  new  document  to  an  inverted  file  is  time  consuming, 
particularly  if  the  document  is  long.  A  possible  remedy 
is  to  use  signatures  instead  of  an  inverted  file  [2].  Sig- 
natures can  be  updated  efficiently;  however,  they  do 
not  allow  document  feature  weighting  and  hence,  the 
retrieval  effectiveness  of  the  signature  based  method  is 
inferior  to  a  fully  weighted  retrieval  method  [8].  Thus, 
it  is  difficult  to  achieve  simultaneously  high  retrieval 
effectiveness,  high  retrieval  efficiency,  and  high  update 
efficiency. 

In  our  approach,  we  focus  on  the  retrieval  from  large 
and  dynamic  data  collections  where  we  face  the  prob- 
lem of  achieving  simultaneously  high  retrieval  effective- 
ness, high  retrieval  efficiency,  and  high  update  efficiency. 
Addressing  this  problem  is  justified  by  the  need  for  ap- 
propriate retrieval  capabilities  in  dynamic  environments 
such  as  in  office  automation  or  in  newscasting  compa- 
nies. Within  the  SPIDER  project  [ll]  we  have  devel- 
oped a  retrieval  method  and  an  access  structure  which 
supports  fully  weighted  retrieval  and  which  achieves 
short  response  times  even  when  the  collection  is  updated 
several  times  every  second.  In  Section  2,  we  describe  our 
basic  approach,  in  Section  3,  we  focus  on  refinements  of 
this  basic  approach,  in  Section  4,  we  present  the  results, 
and  in  Section  5,  we  draw  some  conclusions. 

2    SPIDER'S  Query  Evaluation 
Algorithm 

In  this  section,  we  present  a  retrieval  method  to- 
gether with  a  new  access  structure  facilitating  both  fast 
weighted  retrieval  and  efficient  updates  of  the  access 
structure.  A  preliminary  version  was  presented  in  [11]. 
Our  access  structure  consists  of  signatures  and  non- 
inverted  descriptions  of  the  documents.  The  signatures 
are  used  to  compute  approximate  retrieval  status  values 
RSV°{q,dj)  first  and  the  non-inverted  document  de- 
scriptions are  then  used  to  determine  the  exact  retrieval 
status  values  RSV[q,dj).  The  trick  is  that  the  approx- 
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imate  retrieval  status  values  provide  fairly  tight  upper 
bounds  for  the  exact  retrieval  status  values.  We  will  see 
how  we  can  take  advantage  of  these  upper  bounds  such 
that  only  a  few  exact  retrieval  status  values  have  to  be 
computed. 

The  retrieval  method  determining  the  approximate 
retrieval  status  values  and  the  retrieval  method  deter- 
mining the  exact  retrieval  status  values  are  given  by  the 
functions  RSV'^  and  RSV  respectively.  Let 

*     ~  {(pO,...,(Pm-l} 

be  the  indexing  vocabulary  (e.g.  a  set  of  terms)  and  let 
D    :=  {do,...,dn-i} 

be  the  current  document  collection.  We  assume  that 
the  signatures  a{dj )  consist  of  w  bits  where  the  bit  at 
position  p  has  the  value  <T{dj)[p]. 

=  {a{dj)[0],...,<r(d^)[w-l]) 

Every  indexing  feature  (pi  is  assigned  a  bit  position  p  = 
h{(pi)  by  means  of  a  hash  function  h((p). 

/i:*->{0,...,u;-l} 

The  function  h  specifies  a  signature  cr(dj )  for  every  doc- 
ument dj  by  setting  the  bit  at  position  p  iff  dj  contains 
a  feature  <pi  which  is  hashed  to  this  position. 

^y^-jJiyi  ^0  otherwise 

We  now  define  the  approximate  retrieval  status 
value  RSV'^{q,dj)  and  the  exact  retrieval  status  value 
RSV{q,dj)  as  follows. 

RSV^{q,dj):=-^  *  ^1  <  *  (1) 

RSV{q,dj):=  -ir-  *      ^     flij  *  6i  (2) 

where  a^j  denote  the  approximate  weights  of  document 
features,  Oij  denote  the  exact  document  weights,  and  bi 
denote  the  query  weights.  The  size  |  dj  |  of  the  descrip- 
tion vector  dj  is  defined  as  usual  (see  Table  1). 

Our  basic  approach  consists  of  a  tf  *  idf  weighting 
scheme,  i.e.  the  query  features  are  "ntn"  weighted 
whereas  the  document  features  are  "ntc"  weighted  (see 
Table  1). 

The  query  weights  6j  depend  on  the  feature  frequen- 
cies ff{ipi ,  q)  and  on  the  normalized  inverse  document 
frequencies  nidf{(pi).  The  exact  document  weights  Cij 
depend  on  the  feature  frequencies  ff{ipi,dj)  and  on  the 
nidf{(pi).  In  addition,  they  (oij)  are  cosine  normalized, 
which  is  expressed  by  the  division  by  \dj  \  in  (1)  and 


Figure  1:  Computation  of  the  exact  retrieval  status  val- 
ues in  decreasing  order  of  the  approximate  retrieval  sta^ 
tus  values. 


(2).  The  md/((p,-)'s  are  determined  by  the  document 
frequencies  df{(pi).  The  normalizations  of  the  feature 
frequency  component  and  the  normalizations  of  the  doc- 
ument frequency  component  do  not  affect  the  retrieval 
status  values  because  they  cancel  out  when  using  the 
cosine  measure. 

Because  of  these  normalizations,  the  approximate 
document  weights  a°y  are  always  upper  bounds  of  the 
exact  document  weights  Cjj ,  but  they  do  not  depend  on 
the  feature  frequencies  ff{(Pi,dj). 


0  <  aij  <  Oij 
0<bi 


It  is  easy  to  show  that  RSV{q,dj)  =  0  if  £r(g)  A 
(T{dj)  —  0.  Furthermore,  from  the  fact  that  <pi  G  dj 
implies  cr{dj)[h{(pi)]  =  1  and  from  Oij  <  a^j  follows 


RSV{q,dj)    <  RSV°{q,dj) 


(3) 


In  what  follows,  we  describe  the  evaluation  algorithm 
by  means  of  an  example.  Let  q  be  the  user's  query. 
In  a  first  step,  bitwise  AND-operations  are  performed 
with  the  query  signature  a{q)  and  the  signature  cr{dj) 
of  every  document  dj  G  D.  If  <7{q)  A  o-{dj)  is  non-zero 
then  the  approximate  retrieval  status  value  RSV°{q,  dj) 
is  computed.  The  first  step  is  finished  by  ordering  the 
documents  in  decreasing  order  of  the  approximate  re- 
trieval status  values.  Remember  that  a{q)  A  <T{dj)  =  0 
implies  that  RSV{q,dj)  =  0  and  hence,  dj  can  be  ig- 
nored in  the  sequel. 

In  the  second  step,  the  exact  retrieval  status  values 
are  computed  in  the  order  that  was  previously  deter- 
mined by  means  of  the  approximate  retrieval  status  val- 
ues. As  shown  in  the  following  example,  only  as  many 
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exact  retrieval  status  values  are  computed  as  necessary. 
Assume  that  d-r  is  the  first  (i.e.  dj  has  the  highest 
RSV^),  diz  is  the  second,  and  ds  is  the  third  docu- 
ment as  shown  in  Figure  1.  In  this  case,  RSV{q,dT) 
is  computed  first  and  next,  RSV{q,dxz)  is  computed. 
According  to  Figure  1,  J?5V'(g,  dia)  is  the  highest  ex- 
act retrieval  status  value  of  those  two  exact  retrieval 
status  values  that  are  known  at  this  moment.  Fur- 
thermore, all  unknown  exact  retrieval  status  values  are 
less  than  or  equal  to  RSV^{<i,di)  because  of  (3).  Since 
RSV{q,d\3)  >  RSV°{q,ds),  we  can  conclude  that  dis 
has  the  highest  exact  retrieval  status  value  of  all  docu- 
ments. In  this  example,  only  two  exact  retrieval  status 
values  had  to  be  computed  to  determine  the  top  ranked 
document.  In  practice,  more  exact  retrieval  status  val- 
ues have  to  be  computed. 

The  evaluation  algorithm  is  based  on  an  access  struc- 
ture that  specifies  the  following  functions. 

er  :  dj      cr(dj)  (4) 

TT  :     K-»  {{(Pi,  ff{ipi,dj))  :  (pi  e  dj}  (5) 

nidf  :  <pi  h->  nidf{<pi)  (6) 

size  :  rfj  (-»  I  dj  \  (7) 

The  functions  a  and  tt  determine  the  signatures  and  the 
non-inverted  document  descriptions  respectively.  The 
functions  nidf  and  size  provide  scaling  and  normaliza- 
tion factors.  The  function  values  (^{dj)  and  ir{dj)  are 
determined  completely  by  the  document  dj  itself.  The 
function  values  nidf  {{pi)  and  size{dj),  however,  depend 
on  the  entire  document  collection.  Fortunately,  the  vari- 
ation of  nidf(<pi)  is  small  if  the  data  collection  is  suf- 
ficiently large.  Thus,  nidf[<pi)  has  to  be  recomputed 
only  if  the  domain  of  the  data  collection  is  shifting  and 
size{dj)  has  to  be  recomputed  if  either  nidf  was  recalcu- 
lated or  dj  was  modified.  In  the  next  section,  we  will  see 
how  this  basic  query  evaluation  scheme  can  be  refined. 

3  Refinements 

The  basic  evaluation  algorithm  described  in  Section  2 
achieves  an  excellent  update  efficiency  because  the  de- 
scription of  a  single  document  is  stored  in  a  compact 
form  which  can  be  updated  efficiently.  However,  the 
retrieval  effectiveness  as  well  as  the  retrieval  efficiency 
need  further  refinements  in  the  case  of  the  TREC  experi- 
ment where  extremely  large  queries  are  matched  against 
a  large  document  collection  containing  some  large  doc- 
uments. 

To  improve  the  retrieval  effectiveness,  we  adapted  the 
basic  evaluation  algorithm  to  the  best  global  weighting 
scheme  achieved  by  the  SMART  system  at  TREC-1  [1]. 
The  query  features  are  "Itn"  weighted,  i.e.  the  query 
weights  depend  on  the  logarithm  of  the  feature  frequen- 
cies and  on  the  inverse  document  frequencies,  and  they 


are  not  cosine  normalized.  The  "Inc"  weighted  docu- 
ment features  depend  on  the  logarithm  of  the  feature 
frequencies.  They  are  cosine  normalized,  but  they  do 
not  have  a  document  frequency  component.  The  de- 
tailed definitions  of  the  feature  weights  are  given  in  Tar 
ble  1. 

In  order  to  achieve  a  good  retrieval  efficiency  for  the 
TREC  collection  we  have  reduced  the  indexing  vocab- 
ulary in  such  a  way  that  on  one  hand  the  retrieval 
efficiency  is  improved  and  on  the  other  hand,  the  re- 
trieval effectiveness  is  not  deteriorated  very  much.  Our 
indexing  vocabulary  was  not  only  reduced  by  apply- 
ing a  stop  word  list  and  a  word  reduction  algorithm, 
but  also  by  eliminating  further  indexing  terms.  From 
another  project  [3]  and  from  the  experiences  of  the 
SMART  system  at  TREC-1  [1]  we  knew  that  the  re- 
moval of  the  indexing  features  having  a  very  high  doc- 
ument frequency  does  not  deteriorate  the  retrieval  ef- 
fectiveness very  much.  The  reasons  why  this  restriction 
has  a  small  effect  on  the  retrieval  effectiveness  is  the 
following.  The  elimination  of  the  high  document  fre- 
quency features  changes  the  retrieval  status  values  very 
little  because  these  features  have  a  small  inverse  docu- 
ment frequency  and  hence,  they  contribute  little  to  the 
retrieval  status  values.  Such  an  elimination  of  the  high 
document  frequency  features  is  similar  to  the  applica- 
tion of  a  collection  dependent  stop  word  list  in  addition 
to  a  general  stop  word  list.  Thus,  this  restriction  has 
only  a  small  effect  on  the  retrieval  status  values. 

In  what  follows,  we  focus  on  the  retrieval  efficiency 
and  why  it  is  improved  by  this  restriction.  With  a  re- 
duced indexing  vocabulary, 

1.  fewer  documents  have  a  positive  approximate  re- 
trieval status  value, 

2.  fewer  features  (pi  contribute  an  error  (a^j  —  aij)  *  hi 
to  the  approximate  retrieval  status  values,  and 

3.  fewer  false  matches  (caused  by  collisions)  occur 
when  computing  the  approximate  retrieval  status 
values. 

Both  scanning  the  signatures  and  sorting  the  approxi- 
mate retrieval  status  values  become  faster  when  fewer 
approximate  retrieval  status  values  are  positive.  Fur- 
thermore, the  approximate  retrieval  status  values  be- 
come tighter  upper  bounds  when  fewer  false  matches 
occur  and  when  the  sum  of  the  errors  (o^^  —  Oij)  *  hi 
is  reduced.  Having  tighter  upper  bounds  means  that 
fewer  exact  retrieval  status  values  have  to  be  computed 
until  the  top  ranked  document  is  determined.  The  re- 
trieval effectiveness  and  the  retrieval  efficiency  achieved 
by  means  of  the  reduced  indexing  vocabulary  are  pre- 
sented in  the  next  section. 
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4  Experiments 

In  this  section,  we  present  the  evaluation  of  the  method 
described  above  and  we  compare  it  to  methods  to  other 
weighting  schemes.  We  focus  on  the  efficiency  of  modi- 
fying documents  and  on  the  correlation  between  the  re- 
trieval efficiency  and  the  retrieval  effectiveness.  We  will 
also  see  what  is  the  influence  of  the  vocabulary  restric- 
tion on  the  retrieval  effectiveness  and  on  the  retrieval 
efficiency.  For  the  final  evaluations  we  concentrated  on 
the  adhoc  queries. 

Before  discussing  the  results,  we  define  what  we  mean 
by  a  ■partition,  a  run  and  an  experiment.  The  document 
collection  has  been  split  up  into  several  partitions,  each 
consisting  of  at  most  100,000  documents.  Thus,  the 
large  collections  DOEl  (Department  of  Energy,  Disk  1) 
and  ZIFF3  (Ziff-Davids  Publishing,  Disk  3)  were  divided 
into  three  and  two  partitions  respectively.  A  run  con- 
sists of  the  evaluation  of  50  queries  (either  the  set  of 
routing  queries  or  the  set  of  ad  hoc  queries)  against  all 
documents  of  one  partition.  For  each  query,  the  1000 
top  ranked  documents  have  been  retrieved.  An  exper- 
iment consists  of  several  runs  and  the  merging  of  the 
lists  of  ranked  documents  for  each  query.  For  TREC-2, 
the  two  sets  of  experiments  "Topics  51-100  versus  Disk 
3"  and  "Topics  101-150  versus  Disks  1  and  2"  have  been 
evaluated. 

All  efficiency  evaluations  are  based  on  CPU  time 
rather  than  on  real  time  in  order  to  eliminate  side  effects 
from  other  jobs  running  on  the  same  machine.  In  these 
experiments,  we  used  a  SUN  SPARCserver  MP690  with 
128  MBytes  RAM. 

We  derived  the  document  descriptions  directly  from 
the  CD's.  The  indexing  process  included  the  elimination 
of  stop  words  (van  Rijsbergen's  stop  list  [12,  pp.  18])  and 
Porter's  word  reduction  algorithm  [6].  The  normalized 
inverse  document  frequencies  have  been  derived  from 
the  documents  of  disks  1  and  2  only.  Uncompressing 
and  indexing  a  single  document  needs  around  100  msec 
on  an  average  depending  on  the  length  of  the  document. 
The  computation  of  the  inverse  document  frequencies 
from  the  descriptions  took  about  1.5  hours  of  CPU  time. 

The  average  time  for  inserting  a  document  description 
into  the  access  structure  is  on  a  scale  of  10  msec  -  again 
depending  on  the  number  of  features  ipi  per  document. 
Inserting  a  document  description  into  an  inverted  file 
would  need  more  time  because  the  postings  had  to  be 
inserted  into  the  different  lists  associated  to  each  fea- 
ture. 

The  restriction  of  the  vocabulary  was  accomplished 
by  omitting  features  occurring  in  more  than  15%  of  all 
documents  (from  the  disks  1  and  2),  i.e.  in  more  than 
111'337  documents.  We  have  chosen  15%  of  the  collec- 
tion although  also  a  stronger  limit  of  10%  should  not 
affect  the  retrieval  effectiveness  [l].  In  our  experiments 
we  compare  the  15%  limit  ("dfl5")  to  a  non  restricted 


vocabulary  ("all"). 

We  now  have  three  parameters  which  can  be  com- 
bined to  specify  eight  different  retrieval  methods.  Each 
method  can  be  identified  by  a  string  built  from  the  labels 
for  the  document  feature  weighting,  the  query  feature 
weighting  and  the  vocabulary  restriction: 

{doc.feat_weight) .  {query,  feat.weig  hi) .  {vocab) 

In  what  follows,  we  present  the  results  of  the  following 
nine  methods: 

MO  ntc.ntn.all 

Ml  Inc.ltn.all 
M2  lnc.ltn.dfl5 
M3  Inc.ntn.all 
M4  lnc.ntn.dfl5 

M5  Itc.ltn.all 
M6  ltc.ltn.dfl5 
M7  Itc.ntn.all 
M8  ltc.ntn.dfl5 

First,  we  compare  the  retrieval  effectiveness  of  our 
method  (Ml)  described  in  Section  2  to  the  standard 
tf  *  idf  method  (MO)  by  means  of  the  precision-recall 
graph  in  Figure  2.  As  expected,  the  method  Ml  is  more 
effective  than  MO  and  achieves  a  retrieval  effectiveness 
among  the  best  methods  presented  at  TREC-2.  In  order 
to  find  out  the  reason  for  this  difference  in  the  retrieval 
effectiveness  we  must  have  a  closer  look  at  the  infiu- 
ences  of  each  parameter  (document  and  query  feature 
weighting). 

In  Figure  3,  the  11-pts  average  precisions  of  each 
method  (MO  to  M8)  are  plotted  on  the  left  axis,  and 
they  are  connected  to  the  median  response  times  (for 
the  top  ranked  document)  plotted  on  the  right  axis.  The 
most  obvious  conclusion  from  this  graph  is  the  follow- 
ing: the  higher  the  precision,  the  slower  the  response, 
and  vice  versa.  The  method  MO  performs  clearly  worse 
than  the  methods  Ml  to  M8  in  respect  to  both  retrieval 
effectiveness  and  retrieval  efficiency. 

We  concentrate  on  the  response  times  of  the  top 
ranked  document  because  the  response  times  of  all  fur- 
ther ranked  documents  are  of  secondary  interest,  since 
a  user  is  supposed  to  read  the  top  ranked  document  be- 
fore looking  at  the  other  documents  and  the  retrieval 
system  can  retrieve  further  documents  while  the  user  is 
reading  the  top  ranked  document. 

We  can  also  see  from  Figure  3,  what  are  the  influ- 
ences of  the  different  parameters  on  the  average  preci- 
sion. Regarding  the  weighting  of  the  document  features, 
the  "Inc"  weighting  achieves  a  4-10%  higher  precision 
than  the  "Itc"  weighting.  The  "ntc"  looses  5%  of  pre- 
cision compared  to  "Itc" .  In  the  case  of  query  feature 
weighting,  again  the  logarithmic  "Itn"  weighting  is  more 
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Figure  2;  Precision  recall  graphs  of  the  most  effective 
method  (Ml)  and  of  the  least  effective  method  (MO). 


effective  than  the  "ntn"  weighting  (10-15%).  Restrict- 
ing the  vocabulary  results  in  a  2-7%  lower  precision.  We 
can  summarize  the  experiences  as  follows: 

•  The  md/(y>i)'s  in  the  document  feature  weights  have 
a  bad  influence  on  the  retrieval  effectiveness.  It  could 
be  that  for  long  documents  the  estimation  of  the 
lengths  \dj  |  is  inappropriate  when  the  nidf^fiYs  are 
taken  into  account. 

•  Logarithmic  feature  weighting  is  more  appropriate 
for  the  long  TREC  documents  and  queries  than  lin- 
ear feature  weighting.  Logarithmic  feature  weighting 
avoids  an  overweighting  of  features  occurring  very  fre- 
quently within  a  document. 

•  Restricting  the  indexing  vocabulary  by  ommitting  fea- 

tures with  a  high  document  frequency  df{(pi)  has  a 
noticeable  influence  on  the  average  precision. 

In  what  follows,  we  discuss  the  influences  of  the  differ- 
ent parameters  on  the  response  time  (as  shown  in  Fig- 
ure 3).  Restricting  the  vocabulary  accelerates  the  query 
evaluation  (by  9-14%)  for  the  reasons  described  in  Sec- 
tion 2.  The  "ntn"  weighting  of  the  query  features  is 
also  9-14%  faster  than  the  "Itn"  weighting.  For  doc- 
ument feature  weighting,  the  "Inc"  weighting  is  5-10% 
slower  than  the  "Itc"  weighting.  The  "ntc"  weighting  of 
is  even  slower.  These  results  can  be  explained  in  terms 
of  the  approximation  error. 

•  It  is  obvious  that  for  the  "Itc"  weighting  the  approxi- 
mation error  (o°j-  —aij)*bi  is  smaller  than  for  the  "Inc" 
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Figure  3:  Average  precisions  and  response  times  of  the 
first  ranked  document. 
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Figure  4:  Response  times  of  the  first  ranked  document 
per  run  (adhoc  queries  versus  the  listed  partitions)  for 
the  fastest  method  (M8  nltc.ntc.dfl5). 

weighting  because  of  the  scaling  with  the  nidf{<pi). 
On  the  other  hand,  the  approximation  error  for  the 
"ntc"  weighting  is  very  large  compared  to  both  the 
"Itc"  and  the  "Inc"  weighting  because  of  the  normal- 
ized linear  weighting  instead  of  the  normalized  loga- 
rithmic weighting. 

The  box  plots  [7,  pp.  336]  presented  in  Figure  4  show 
the  distribution  of  the  response  times  required  to  deter- 
mine the  top  ranked  document.  A  box  plot  visualizes 
the  median  (the  line  within  the  box),  half  of  all  samples 
(the  box),  and  outliers  (the  dots)  of  a  sample  collection. 
In  our  case  we  have  50  samples:  the  response  times  of 
the  50  adhoc  queries  run  against  a  partition.  For  most 
queries  the  top  ranked  document  was  retrieved  in  less 
than  two  seconds.  The  few  outliers  were  all  produced  by 
the  same  queries  (#103,  #136,  #138,  #144).  In  gen- 
eral, the  response  times  become  shorter  if  a  partition 
contains  less  documents  and  if  the  document  descrip- 
tions consist  of  less  postings  on  an  average  (see  Fig- 
ure 5). 

5  Conclusions 

In  our  approach  we  stressed  the  update  efficiency.  We 
have  shown  that  the  retrieval  effectiveness  does  not  have 
to  be  sacrificed  to  achieve  a  high  update  efficiency  when 
coping  with  highly  dynamic  document  collections.  Our 
approach  could  probably  be  further  improved  by  find- 


Figure  5:  Number  of  postings  per  partition. 

ing  a  weighting  scheme  that,  on  the  one  hand,  achieves  a 
very  good  retrieval  effectiveness  and  that,  on  the  other 
hand,  can  be  approximated  by  frequency  independent 
weights  with  only  little  variation  from  the  exact  weights. 
The  retrieval  efficiency  could  be  improved  by  better 
partitioning  the  document  collections  according  to  the 
lengths  of  the  documents.  Our  approach  seems  to  be 
very  amenable  to  parallel  processing.  We  may  think  of 
several  configurations  (partitioning  the  query,  partition- 
ing the  document  collection,  etc.).  At  this  moment,  it  is 
not  clear  which  configuration  is  appropriate  for  which 
requirements.  Furthermore,  we  do  not  know  yet  how 
dynamic  the  document  collection  must  be  such  that  our 
access  structure  outperforms  inverted  files. 
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1.  Exact  and  Approximate  Retrieval  Status  Values 


"exact"  :  RSV{q,dj)    :=  *  aij*bi 

"approx"  :  RSV'>{q,dj)    -=    jir- *  E  <  * 


2.  Document  Length 


3.  Weights  of  Query  Features 


V  Vi^dj 


"ntn"  :  bi    :=  ff{<Pi,q)*nidf{(pi) 
"ltn":bi    :=    {1  +  log{ff{<pi,q)))  *  nidf{>pi) 


4.  Exact  and  Approximate  Weights  of  Document  Features 


,0 


"/nc"  :  Oij 


'He"  :  (Hj 

,0 


max{//(yjfc,d,)|y)fc€d,} 
1  *  md/(y>j) 

(l  +  /og(//(yi,d,))) 
(1  +  log{msLx{ff{<ph,dj)  |      £  dy})) 
1 

(i  +  M//(y>».^;))) 

(1  +  log{max{ff{(ph,dj)  \  <ph  6  dj})) 
1  *  nidf{(pi) 


*  nidf{(pi) 


5.  Normalized  Inverse  Document  Frequency 

niifM    :=  l-'°f 

log(l  +  n) 


Table  1:  Classification  of  weighting  schemes. 
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Abstract 

Most  text  retrieval  and  filtering  systems  depend  heavily  on 
the  accuracy  of  the  text  they  process.  In  other  words,  the 
various  mechanisms  that  they  use  depend  on  every  word  in 
the  text  and  in  the  queries  being  correctly  and  completely 
spelled.  To  get  around  this  limitation,  our  experimental  text 
filtering  system  uses  N-gram-based  matching  for  document 
retrieval  and  routing  tasks.  The  system's  first  application 
was  for  the  TREC-2  retrieval  and  routing  task.  Its  perfor- 
mance on  this  task  was  promising,  pointing  the  way  for  sev- 
eral types  of  enhancements,  both  for  speed  and 
effectiveness . 

1.0  Background 

Even  though  modem  character  recognition  techniques  are 
steadUy  improving,  scanning  and  recognizing  characters  in 
paper  documents  stiU  results  in  many  errors.  These  errors 
can  impact  the  ability  of  a  standard,  classical  text  retrieval 
system  to  process  the  resulting  ASCII  text  by  interfering 
with  word  stemming,  stopword  identification,  or  even  basic 
indexing.  Furthermore,  as  paper  documents  age,  their  print 
quality  can  degrade,  possibly  causing  further  recognition 
problems.  This  is  becoming  a  serious  problem,  especially 
given  the  vast  amount  of  material  that  exists  in  a  paper  form 
that  has  a  limited  Ufetime. 

Another  difficulty  is  that  the  texts  themselves  may  contain 
spelling  variations  (e.g.,  British  vs.  American)  or  outright 
speUing  errors.  The  same  is  true  of  the  queries  entered  by  a 
human  user.  Discrepancies  between  the  spelling  of  a  word 
in  a  document  and  a  word  in  a  query  may  prevent  a  match. 

A  third  problem  area  is  that  current  word  stemming  and 
stopword  list  construction  algorithms  sometimes  involve  a 
significant  amount  of  knowledge  about  the  documents'  lan- 
guage. Although  there  are  automatic  approaches  to  acquir- 
ing this  kind  of  knowledge,  it  is  complicated  [1].  An  ideal 
text  retrieval  method  would  require  no  special  language 
knowledge,  and  in  fact  would  be  able  to  handle  multiple 
languages  (represented,  say,  in  Unicode)  simultaneously. 


AU  of  these  problems  have  analogs  in  a  somewhat  different 
problem  domain,  namely,  reading  and  interpreting  postal 
addresses  on  pieces  of  mail.  For  the  last  eight  years,  the 
Environmental  Research  Institute  of  Michigan  (ERIM)  has 
conducted  research  for  the  United  States  Postal  Service 
(USPS)  in  automatic  address  interpretation.  The  ultimate 
goal  of  this  research  is  to  build  a  high-performance  address 
interpretation  unit  that  can  reUably  and  efficiendy  read  the 
address  information  on  a  mailpiece,  and  determine  its  cor- 
rect 9-digit  or  11 -digit  ZIP  code,  even  if  that  code  is  not 
present  on  the  mailpiece  or  is  in  error. 

There  are  two  parts  to  interpreting  an  address  from  a  mail- 
piece,  and  both  of  them  are  difficult: 

•  Recognizing  the  characters  in  the  address  block.  This  is 
difficult  because  of  poor  print  quaUty,  poor  contrast, 
complicated  backgrounds,  presence  of  logos  and  non- 
address  information,  use  of  proportional  fonts,  and  a 
host  of  other  problems. 

•  Interpreting  the  lines  of  the  address.  This  is  difficult 
because  of  spelling  errors,  addressing  errors,  unusual 
abbreviations,  strange  local  addressing  conventions,  and 
errors  and  omissions  in  the  USPS  databases. 

Our  current  address  interpretation  systems  use  a  number  of 
different  techniques  to  get  around  these  problems.  Among 
these  is  the  use  of  N-gram-based  matching  to  find  matches 
in  postal  databases.  (Please  see  [2]  and  [3]  for  details  of 
these  systems.) 

In  this  work,  N-grams  are  N-character  slices  of  some  longer 
string.  We  do  not  use  positional  N-grams;  i.e.,  what  is 
important  is  whether  a  given  string  contains  a  particular 
N-gram,  not  where  it  falls  in  the  string.  We  use  bi-grams 
(N=2)  and  tri-grams  (N=3)  together  in  the  same  system. 
Bi-grams  and  tri-grams  complement  each  other  in  the  sense 
that  bi-grams  provide  somewhat  better  matching  for  indi- 
vidual words,  while  tri-grams  provide  the  connections 
between  words  to  improve  phrase  matching. 
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N-gram-based  matching  addresses  the  difficulties  men- 
tioned above  for  postal  addresses  in  ways  that  have  applica- 
tion for  document  retrieval: 

•  It  usually  finds  good  matches  for  postal  addresses  even 
in  the  face  of  a  significant  number  of  individual  charac- 
ter recognition  errors.  This  is  particularly  true  if  the  sys- 
tem can  use  redundant  data  for  the  search  key,  such  as 
the  city,  state,  and  ZIP  together.  This  is  identical  to  the 
problem  of  matching  document  text  that  was  scanned 
from  poor  quality  images.  Words  that  occur  near  one 
another  in  a  text  also  "reinforce"  each  other  for  search- 
ing the  way  that  city,  state,  and  ZIP  do. 

•  It  easily  overcomes  problems  with  spelling  and  address- 
ing variations  and  errors  on  mailpieces  or  in  postal  data- 
bases. This  is  analogous  to  the  problem  of  matching 
document  text  that  contains  spelling  variations.  It  also 
achieves  an  effect  similar  to  stemming  in  that  different 
words  having  the  same  root  are  considered  to  be  similar. 

•  It  can  also  handle  abbreviations  fairly  easily.  This  is  also 
similar  to  the  problem  of  word  stemming. 

•  Some  postal  address  interpretation  systems  depend 
heavily  on  the  system's  ability  to  recognize  certain  key 
words,  such  as  STREET  or  BOX.  These  words  play  a 
role  similar  to  stopwords  in  that  they  carry  little  content, 
and  mostly  provide  structural  information.  N-gram- 
based  matching  can  handle  errors  even  in  these  words. 

Drawing  on  the  experience  of  using  N-grams  in  postal 
work,  ERIM  has  been  adapting  these  techniques  for  use  in 
full-text  retrieval  and  filtering.  We  currently  have  a  working 
full-text  retrieval  system,  caUed  zview,  which  provides  a 
simple  query  faciUty  for  accessing  a  wide  variety  of  ASCII 
documents,  including  biographies,  abstracts,  and  documen- 
tation. The  zview  system's  N-gram-based  matching  provides 
a  very  simple,  easy-to-use  text  retrieval  system.  It  is  tolerant 
of  spelling  variations  and  errors  in  both  the  text  and  the  doc- 
uments. It  also  provides  a  straightforward  ranking  of  docu- 
ments that  enables  the  user  to  see  which  documents  best 
match  the  query.  This  system  has  given  us  some  confidence 
in  the  wider  utility  of  N-gram-based  matching  for  informa- 
tion retrieval.  Although  the  TREC  task  would  not  seem  to 
directiy  require  inexact  matching  (the  documents  were  high 
quality  ASCII),  our  experience  with  zview  indicates  that  this 
kind  of  tolerance  for  variations  in  user  input  is  very  useful 
for  general  retrieval. 

Incidentally,  one  criticism  that  one  can  make  of  N-gram- 
based  systems  is  that  they  require  large  amounts  of  storage. 
Although  this  is  true  of  the  naive  implementation,  there  are 
a  number  of  variations  using  various  superimposed  coding 


techniques  that  provide  substantial  space  savings  without 
impacting  matching  performance.  See  [2]  for  details. 

2.0  System  Description 

The  zview  system  mentioned  above  has  two  parts:  an  off- 
line index  builder,  and  an  on-line  retrieval  interface.  In  its 
current  form,  zview  does  not  yet  use  our  most  compact 
N-gram  representation  techniques,  and  is  thus  not  well- 
suited  for  very  large  (i.e.,  gigabyte  size)  databases.  Also,  it 
can  handle  only  single  query  strings,  not  the  multiple  query 
strings  necessary  for  TREC.  Instead,  for  this  task  we  opted 
to  use  a  simple  filter  architecture,  and  concentrated  on  mak- 
ing it  as  fast  as  possible.  This  architecture  has  several 
advantages: 

•  It  requires  no  disk  space  for  an  index.  In  fact,  we  were 
able  to  save  even  more  disk  space  by  leaving  the  original 
documents  in  their  compressed  fonn,  and  uncompress- 
ing them  on  the  fly. 

•  It  is  conceptually  simple,  requiring  only  a  very  modest 
amount  of  code  (987  lines  of  C  source).  The  whole  sys- 
tem took  only  a  few  days  to  design  and  build. 

This  architecture  is  similar  in  concept  to  that  used  by  Mark 
Zimmerman's  PARA  system,  which  participated  in  last 
year's  TREC.  That  system  used  a  set  of  handcrafted  Unix 
awk  scripts  to  filter  the  documents.  The  PARA  system, 
although  it  got  good  results ,  took  22  days  on  a  more  or  less 
dedicated  NeXT  machine,  even  though  its  match  criteria 
generally  specified  fewer  match  terms  than  our  system  did. 
Also  Zimmerman's  system  used  the  notion  of  proximity  to 
identify  sets  of  matching  lines  that  occurred  near  one 
another,  whereas  our  system  counts  matches  throughout  the 
document  without  regard  to  the  proximity  of  match  terms. 
See  [4]  for  details. 

Obviously,  a  significant  disadvantage  of  such  a  filter-type 
architecture  is  that  the  system  has  to  read  all  of  the  data  to 
simulate  a  retrieval.  This  is  very  computationally  intensive, 
but  we  were  able  to  mitigate  this  somewhat  in  our  system  by 
handling  much  of  the  actual  N-gram-based  matching  with  a 
highly  parallel  bit- vector  approach,  which  we  describe 
below.  Also,  we  were  able  to  split  the  processing  load  up 
among  a  suite  of  computers  and  run  them  all  in  a  coarse- 
grained parallel  fashion. 

Figure  1  gives  an  overview  of  our  system  in  dataflow  fash- 
ion. We  discuss  each  of  the  processes  in  this  diagram  in  the 
following  sections. 
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Figure  1:  Dataflow  Diagram  for  N-Gram-Based  Text 
Retrieval 


2.1  Generating  Query  Strings  From  TREC 
Topics 

The  process  labelled  "Generate  Queries"  represents  a  pro- 
gram gen_query,  which  takes  the  TREC  topics  file  and  gen- 
erates a  set  of  query  strings  for  each  topic.  We  considered 
several  different  schemes  for  extracting  query  strings  from 
the  topics,  but  finally  settled  on  using  just  the  phrases  in  the 
concept  and  nationality  sections.  The  following  is  a  typical 
topic  from  TREC- 1: 

<num>  Number  007 

<title>  Topic:  U.S.  Budget  Deficit 

<desc>  Description:  Document  will  mention  a 

proposal  to  decrease  the  U.S.  budget  deficit. 

...<con>  Concept(s): 

1 .  U.S.  budget  deficit,  federal  budget  shortfall 

2.  foreign  affairs  budget,  defense  budget,  en- 
titlements 

3.  increased  revenues,  tax  increase,  tax  re- 
form, auction  quota 

4.  reduction  in  expenditures,  spending  cuts, 
cutting  domestic 

programs,  eliminating  government  subsidies 

5.  NOT  financing  the  U.S.  budget  deficit 


<nat>  Nationality:  U.S. 


Given  this  topic,  the  gen_query  program  generates  the  fol- 
lowing set  of  query  strings: 

007  000  U.S. 

007  000  U.S.  budget  deficit 

007  001  federal  budget  shortfall 

007  002  foreign  affairs  budget 

007  003  defense  budget 

007  004  entitlements 

007  005  increased  revenues 

007  006  tax  increase 

007  007  tax  reform 

007  008  auction  quota 

007  009  reduction  in  expenditures 

007  010  spending  cuts 

007  01 1  cutting  domestic  programs 

007  01 2  eliminating  government  subsidies 

007  013  NOT  financing  the  U.S.  budget  deficit 
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Each  line  of  the  query  output  corresponds  to  one  nationality 
or  concept  string  from  the  topic.  The  nationality  strings  are 
listed  first,  followed  by  the  concept  strings  in  the  order  they 
appeared.  There  is  also  a  rank  number  associated  with  each 
query  string.  Generally,  as  concept  strings  were  used  in  the 
topics,  strings  that  appeared  earlier  in  the  list  were  more 
important  than  strings  that  appeared  later.  The  actual  filter- 
ing program  also  uses  these  rank  numbers  to  weight  the 
query  strings  in  importance.  Each  string  receives  a  weight 
equal  to  2k  -  r,  where  k  is  the  number  of  query  strings  for 
the  topic  and  r  is  the  string's  rank  value.  Any  nationality  or 
concept  string  can  also  have  a  NOT  prefix,  which  triggers  a 
slightly  different  kind  of  processing  for  these  strings, 
described  below. 

2.2  The  Problem  of  Finding  Matches  While 
Filtering  Text 

In  Figure  1  above,  the  process  labelled  "zcat"  represents  a 
Unix  utility  program  that  takes  a  compressed  file  and 
returns  its  uncompressed  form.  The  process  labelled  "Find 
Matches"  represents  a  program  firui_matches ,  which  takes  a 
set  of  queries  for  all  of  the  topics  at  once,  and  applies  them 
to  a  single  uncompressed  text  file.  Given  the  filter  architec- 
ture for  this  system,  it  was  important  for  us  to  minimize  the 
total  amount  of  I/O.  By  letting  find_matches  process  the 
queries  for  all  topics  simultaneously,  the  system  has  to  pass 
the  data  only  once. 

To  make  this  process  as  efficient  as  possible,  _^J_mafc/ie5 
does  most  of  its  matching  using  bit-vectors  representing  the 
presence  or  absence  of  particular  N-grams.  Suppose  that  we 
had  a  string  that  we  wanted  to  search  for  in  a  document 
using  N-grams.  An  obvious  way  to  do  it  would  be  to  per- 
form the  following  steps: 

•  Break  the  query  string  up  into  N-grams. 

•  For  every  line  in  the  document,  count  how  many  of 
those  N-grams  occurred  in  the  line.  This  count  would  be 
the  match  score  for  that  line. 

•  Keep  a  sorted  Hst  of  the  top-scoring  Unes,  bumping  the 
lowest  scoring  entry  on  this  list  when  a  new  line  has  a 
better  score. 

When  this  algorithm  finishes,  we  will  have  a  list  of  the  top 
scoring  matches  for  our  query  string.  We  could  make  this 
process  a  little  more  sophisticated  by  putting  a  threshold  of 
some  sort  on  it.  If  a  line's  match  score  was  below  this 
threshold,  say  80%  of  the  number  of  N-grams  in  the  query 
string,  we  could  ignore  this  line  as  simply  not  containing 
enough  of  the  query  string  to  consider. 


Unfortunately,  this  naive  implementation  of  N-gram-based 
matching,  which  uses  nested  loops  to  compare  every 
N-gram  in  the  query  string  with  every  N-gram  in  a  line,  is 
computationally  very  costly. 

2.3  Matching  and  Scoring  Lines 

We  can  do  significantly  better  than  the  naive  N-gram-based 
algorithm  described  above  by  using  hashing.  Figure  2  illus- 
trates this  process  for  bi-grams.  The  key  idea  here  is  to 
break  the  matching  process  into  two  parts: 

•  Initialize  a  hash-code-based  look-up  table  for  all  of  the 
N-grams  in  the  query  string.  Each  slot  simply  records 
whether  any  N-gram  hashed  to  that  position  (1)  or  not 
(0).  Thus  the  hash  table  is  a  simple  bit  vector. 

•  For  each  line  in  the  document,  make  a  single  pass  for  aU 
of  the  line's  N-grams,  checking  each  one  to  see  if  it  is  in 
the  table.  In  other  words,  see  if  the  slot  (bit)  correspond- 
ing to  the  N-gram  contains  a  1.  If  it  does,  add  1  to  the 
line's  score. 

If  we  make  the  hash  table  big  enough,  we  can  make  the 
probability  of  collisions  arbitrarily  small.  For  example,  con- 
sider that  there  are  only  52,022  possible  bi-grams  and  tri- 
grams  for  the  alphabet  consisting  of  A-Z,  0-9,  and  space.  A 
hash  table  size  of,  say,  20,000  slots  (625  32-bit  words  or 
2500  bytes)  provides  far  more  than  ample  space  to  avoid 
collisions.  If  there  is  a  collision  between  the  hash  of  an 
N-gram  in  the  query  string  and  that  of  some  different 
N-gram  in  the  line,  its  only  effect  is  to  cause  that  line's  score 
to  be  one  higher  than  it  should  be.  It  is  even  more  unlikely 
that  two  or  more  of  the  N-grams  would  also  collide.  Thus  by 
allowing  collisions  in  the  hash  table,  we  get  a  simple,  quick 
algorithm  at  the  cost  of  a  small  probability  of  small  distor- 
tions in  the  line  scores.  See  [2]  for  a  further  discussion  of 
how  to  estimate  the  probability  of  these  kinds  of  collisions. 
Note  that  we  include  leading  and  trailing  spaces  for  words 
in  counting  N-grams.  Also  recall  that  in  our  full  implemen- 
tation of  this  we  process  tri-grams  the  same  way  and  at  tlie 
same  time.  The  match  score  between  two  strings  includes 
the  count  of  both  matching  bi-grams  and  matching  tri- 
grams. 

To  get  an  additional  gain  in  efficiency,  we  can  carry  this 
hash  table  scheme  for  counting  N-grams  another  step  by 
using  parallelism  to  match  a  single  line  from  a  document 
against  a  number  of  query  strings  all  at  the  same  time.  Fig- 
ure 3  illustrates  this  scheme.  Again,  matching  is  a  two-step 
process: 
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Figure  2:  Using  Bi-Grams  to  Compare  "STRING"  and  "SPRUNG" 


•  Initialize  a  hash  table  for  the  entire  set  of  query  strings. 
Each  slot  in  the  hash  table  again  represents  one  N-gram, 
but  it  is  now  a  bit  vector  in  which  the  kth  bit  indicates 
whether  the  N-gram  occurs  in  the  kth  query  string. 

•  For  every  line  in  the  document,  hash  each  N-gram  to 
find  the  corresponding  bit-vector.  Then  add  that  bit  vec- 
tor to  the  current  score  vector  for  the  document.  The  kth 
element  of  this  score  vector  keeps  the  current  match 
score  for  the  document  against  the  kth  query  string. 

One  other  significant  aspect  of  this  multi-query-string 
implementation  is  that  we  also  represent  the  final  score  vec- 
tor for  all  of  the  queries  as  a  set  of  parallel  bit- vectors.  This 
effectively  allows  us  to  perform  32  additions  in  parallel  by 
adding  one  32-bit  word  of  an  N-gram  bit-vector  from  the 
table  to  one  32-bit  word  representing  the  1-bits  of  32  query 
string  scores,  and  then  propagate  the  carry  bits  up  to  bit- vec- 
tors representing  the  higher  order  bits  of  the  scores.  Effec- 
tively, this  lets  us  compute  scores  for  32  query  strings  in 
close  to  the  same  amount  of  time  as  it  takes  to  compute  the 
score  for  a  single  query  string. 


2.4  Scoring  and  Ranking  Documents 

After  the  system  computes  the  N-gram  match  scores  of  a 
single  Une  against  all  of  the  query  strings,  it  applies  a 
threshold  criterion.  If  the  score  for  a  particular  query  string 
is  below  its  threshold,  the  system  treats  it  as  a  zero  score. 
The  threshold  is  defined  as  a  percentage  of  the  maximum 
possible  N-gram  score  for  the  query  string.  For  example,  if 
a  particular  query  string  has  a  maximum  possible  score  of 
20  matching  N-grams,  and  the  threshold  percentage  (as 
determined  by  a  command  line  argument  to  the  program)  is 
70%,  then  the  system  will  ignore  any  line  whose  score  is 
less  than  14.  There  is  a  similar  command  Une  argument 
which  determines  the  threshold  percentage  for  negation 
query  strings  (those  tagged  with  a  NOT  prefix  in  the  query 
string  set).  If  a  document  has  a  line  whose  score  for  a  nega- 
tion query  string  exceeds  its  threshold,  the  system  will  dis- 
card the  document.  For  most  runs  of  the  system,  we  got  the 
best  results  with  a  match  threshold  at  70%  and  a  negation 
threshold  at  95%.  In  other  words,  we  were  willing  to  accept 
a  line  as  a  match  for  a  given  query  string  if  it  had  70%  of  the 
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Figure  3:  Parallel  Implementation  for  Multiple  Query  Strings 


requisite  N-grams,  but  needed  a  nearly  exact  match  in  order 
to  discard  it  for  matching  a  negation  query. 

To  compute  the  scores  for  a  whole  document,  we  accumu- 
lated the  line  scores  for  each  query  string,  but  applied  a  cap 
at  twice  the  maximum  possible  N-gram  score.  In  other 
words,  if  a  document  mentions  a  particular  query  string 
twice,  that  is  twice  as  good  as  mentioning  it  once.  However 
mentioning  it  three  or  more  times  is  of  no  additional  value. 
We  put  this  cap  into  the  system  after  observing  that  initial 
versions  of  the  system  would  overrate  some  documents 
because  they  mentioned  some  low  ranking  query  string  for  a 
given  topic  many  times,  but  completely  missed  any  of  the 
higher  ranking  query  strings  for  the  topic. 

To  generate  the  final  scores  for  a  document  against  a  topic, 
the  system  accumulates  the  scores  for  each  query  string  in 
the  topic,  multiplying  them  by  the  weighting  factor  com- 
puted earlier.  If  this  aggregate  weighted  score  exceeds  a 
hard-coded  threshold  of  40,  the  system  then  writes  out  this 
score  as  a  relevance  judgement  for  the  document.  Later  a 
separate  program  reads  the  file  containing  the  relevance 
scores,  computes  the  top  1000  scores  for  each  topic,  and 
writes  them  out  as  the  final  system  result. 


3.0  Multi-Processor  Execution 

Even  with  all  of  this  attention  to  efficient  implementation, 
the  system  stiU  required  a  large  amount  of  computation. 
Fortunately,  ERIM  had  available  a  network  of  Sun  worksta- 
tions over  which  we  could  distribute  the  processing  load. 
This  network  consisted  of  two  Sun  SPARCstations  and  five 
or  six  Sun  SPARCstation  2  machines.  We  used  a  very  sim- 
ple partitioning  scheme,  assigning  each  machine  a  fixed  set 
of  documents  to  evaluate  against  aU  of  the  topics.  Given  this 
arrangement,  a  full  run  of  topic  set  3  against  Disks  1  and  2 
took  12  to  13  hours  if  aU  of  tiie  machines  were  fully  avail- 
able. When  there  was  significant  contention  on  some  of  the 
machines,  the  run  might  take  as  much  as.  24  hours.  An  obvi- 
ous drawback  with  using  the  fixed  partitioning  scheme  was 
that  often  one  or  more  of  the  machines  would  not  finish 
until  many  hours  after  all  of  the  others  had.  This  was  usually 
due  to  unexpected  heavy  contention  on  one  or  more 
machines.  If  we  had  used  a  more  intelligent  and  dynamic 
document  file  assignment  mechanism  (implemented  in,  say, 
the  Linda  coordination  language),  we  would  have  had  better 
elapsed  execution  times. 
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TABLE  1.  Selected  TREC-2  Test  Results 


Test 

Number 

Note 

Test  Type 

Query 
Thresh. 

Negation 
Thresh. 

Average 
Precision 

At  10  Docs 

At  100 
Docs 

R- 

Precision 

erimal 

official 

ad  hoc 

80 

80 

0.1589 

0.3960 

0.3016 

0.2179 

erima2 

official 

ad  hoc 

75 

80 

0.1885 

0.4480 

0.3426 

0.2494 

a30 

official 

ad  hoc 

70 

95 

0.1604 

0.4040 

0.3038 

0.2198 

a31 

official  w/ 
X  queries 

ad  hoc 

70 

95 

0.1153 

0.4000 

0.259 

0.1904 

a33 

spanning 
lines 

ad  hoc 

70 

95 

0.0734 

0.15 

0.1484 

0.1333 

a34 

span  lines; 
exp. decay 
weighting 

ad  hoc 

90 

95 

0.1099 

0.314 

0.2276 

0.1776 

a35 

span  lines; 
exp.  decay 
weighting 

ad  hoc 

80 

95 

0.1336 

0.3000 

0.2442 

0.2004 

a35-AP 

APonly 

ad  hoc 

80 

95 

0.2717 

0.4740 

0.2838 

0.3140 

a36 

span  lines; 
exp. decay 
weighting 

ad  hoc 

70 

95 

0.0792 

0.1780 

0.1678 

0.1387 

erimrl 

official 

routing 

80 

80 

0.1219 

0.3580 

0.2304 

0.1814 

erimr2 

official 

routing 

75 

80 

0.1415 

0.4240 

0.2524 

0.2031 

r5 

official 

routing 

70 

95 

0.1225 

0.3600 

0.2310 

0.1818 

i6 

span  lines; 
exp. decay 
weighting 

routing 

80 

95 

0.1015 

0.2880 

0.1954 

0.1604 

r7 

span  lines; 
exp. decay 
weighting 

routing 

70 

95 

0.0602 

0.2200 

0.1388 

0.1123 

4.0  Results 

To  test  our  system,  we  ran  it  a  number  of  times,  varying  dif- 
ferent system  parameters.  After  the  conference,  we  were 
also  able  to  make  a  few  changes  and  run  it  again.  Table  1 
above  summarizes  some  highlights  of  our  results,  which 
show  a  number  of  interesting  points: 

•  The  tests  numbered  erimal,  erima2,  erimrl  and  erimr2 
were  the  official  results  turned  in  on  June  1 .  The  first  two 
were  the  ad  hoc  results,  and  the  second  two  were  the 
routing  results. 

•  The  tests  numbered  a30,  a35,  a36,  r5,  and  r7  were  aU 
attempts  to  determine  good  settings  for  the  query  thresh- 
old and  negation  threshold  parameters.  Unfortunately, 
the  results  from  these  test  runs  simply  do  not  provide 
anywhere  near  enough  data  to  perform  a  complete  sensi- 
tivity analysis  for  these  parameters.  Also,  we  noticed  in 
some  of  our  testing  on  the  TREC-1  queries  that  there  is 


considerable  difference  in  the  optimum  values  of  these 
thresholds  for  different  topic  sets/data  set  combinations. 

One  of  the  motivations  for  using  N-gram-based  match- 
ing was  that  it  provides  good  matching  performance  in 
the  face  of  textual  errors.  To  test  this  idea,  we  ran  test 
a31  using  dehberately  damaged  query  strings.  In  this 
test,  we  took  each  query  string  produced  by  gen_query, 
and  replaced  the  third  character  with  the  letter  "X".  This 
works  out  to  be  an  effective  character  recognition  error 
rate  of  4.5%  over  the  whole  body  of  query  strings. 
Although  the  system  took  a  considerable  hit  in  perfor- 
mance, the  interesting  thing  was  that  it  still  functioned  at 
all.  Many  of  the  other  systems  in  the  TREC  evaluation 
would  most  likely  have  completely  failed  in  an  analo- 
gous test,  since  they  depend  heavily  on  exact  word 
matches. 

One  serious  drawback  to  the  original  system  was  that  it 
did  not  span  hues  when  matching.  That  is ,  if  the  text  that 
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a  particular  query  should  match  was  spht  across  two 
document  lines,  then  the  match  would  fail.  After  the  con- 
ference, we  found  a  relatively  easy  way  to  rectify  this 
design  deficiency  and  were  able  to  run  tests  a33  though 
a36,  and  r6  and  r7.  Unfortunately,  this  change  greatly 
affected  the  system's  response  to  the  query  and  negation 
thresholds,  and  we  were  not  able  to  run  enough  tests  to 
find  the  optimum  values  for  these  parameters.  The  end 
result  is  that  the  best  results  we  have  for  this  improved 
system  are  still  not  as  good  as  the  official  results  we 
reported. 

•  Another  serious  problem  with  the  original  system  came 
to  light  during  the  official  conference.  As  Donna  Harman 
pointed  out  in  her  closing  talk,  no  one  has  yet  really 
explored  the  full  ramifications  of  changes  in  term 
weighting  strategies.  Our  original  system  used  a  very 
simple-minded  linear-faUoff  weighting  scheme.  Our 
assumption  was  that  concept  strings  appeared  in  reverse 
order  of  importance.  However,  we  began  to  suspect, 
based  on  different  concepts  that  were  mentioned  in  a 
number  of  the  conference  presentations,  that  this  was 
overly  simplistic.  We  decided  to  implement  a  straight- 
forward exponential-decay  weighting  scheme.  In  this 
approach,  the  first  query  string  gets  a  weight  of  100,  the 
second  gets  a  weight  of,  say,  90,  the  third  a  weight  of  81 , 
the  fourth  a  weight  of  72,  and  so  on,  with  each  succeed- 
ing weight  taking  90%  of  its  predecessor's  value.  Unfor- 
tunately, we  did  not  have  time  to  tune  the  system's 
response  to  this  change  either,  and  its  results  (a34,  a35, 
a36,  r6,  and  r7)  are  worse  than  the  official  results  as 
well.  However,  it  appears  that  there  is  plenty  of  room  for 
experimentation  with  different  term  weighting  schemes 
and  we  will  continue  working  with  them. 

•  In  our  early  work  with  the  system  before  sending  in  the 
official  results,  we  did  a  fair  amount  of  testing  using  just 
the  Associated  Press  document  set.  We  were  perhaps 
misled  by  how  well  the  system  did  on  these  documents, 
and  missed  some  chances  to  improve  the  system  earlier. 
The  test  labelled  a35-AP  shows  what  the  results  look 
like  for  a35  if  we  restrict  the  new  system  to  just  return- 
ing AP  documents,  and  restrict  the  relevance  judgements 
to  just  AP  documents.  Even  with  this  imperfectly  tuned 
new  version,  we  see  that  the  system  is  capable  of  signifi- 
cantly better  performance.  It  is  unclear  why  there  should 
be  such  variation  between  the  retrievability  of  the  AP 
documents  and  the  other  document  collections. 

At  its  best,  our  system  performed  as  well  as  most  of  the  sys- 
tems that  participated  in  TREC-1.  However,  there  is  ample 
room  for  improvement,  as  we  have  noted  above,  especially 
in  comparison  to  many  of  the  systems  that  came  back  for 
TREC-2. 


5.0  Further  Research 

The  TREC-2  task  is  the  first  real  application  for  our 
N-gram-based  multiple-query  system.  As  in  any  experiment 
of  this  nature,  the  results  and  problems  suggest  many  more 
possible  avenues  of  research.  These  ideas  fall  into  two  cate- 
gories. 

5.1  Analyzing  the  Current  System's 
Performance 

Further  analysis  of  the  existing  system  will  allow  us  to  bet- 
ter understand  its  behavior  and  limitations.  Some  ways  to  do 
that  include: 

•  It  is  likely  that  generating  query  strings  from  the  topic 
concept  strings  may  have  significantly  limited  perfor- 
mance. For  example.  Topic  74  about  instances  where  ttie 
U.S.  government  propounds  conflicting  policies  com- 
pletely failed  to  mention  terms  such  as  policy  or  regula- 
tion in  the  concept  Ust.  Thus,  our  system  had  only  a  very 
small  chance  whatsoever  of  finding  matching  docu- 
ments. Zimmerman's  filtering  system  [4]  did  well  with 
handcrafted  queries,  so  we  should  also  try  manually  gen- 
erated queries. 

•  Currently  the  system  has  a  hard-coded  cutoff  threshold 
of  40  for  the  weighted  aggregate  score.  The  purpose  of 
the  threshold  was  to  prevent  the  system  from  returning 
results  that  were  guaranteed  to  be  noise  because  of  then- 
very  low  score.  This  value  was  set  more  or  less  arbi- 
trarily, so  we  should  experiment  with  changing  this 
threshold  to  determine  its  true  effect.  In  all  likelihood,  it 
could  be  a  fair  amount  higher,  preventing  the  system 
fi-om  generating  other  useless  low-scoring  results. 

•  Currently  the  system  sets  a  cap  of  three  times  the  maxi- 
mum N-gram  score  for  any  query  string  score.  Again, 
this  value  was  determined  only  by  a  very  rough  empiri- 
cal process,  so  we  should  experiment  with  changing  this 
cap,  to  see  how  much  impact  it  has. 

5.2  Extending  the  System 

We  can  also  make  some  significant  changes  to  the  system  to 
explore  possibilities  for  other  performance  improvements. 

•  Currently  the  system  treats  upper  and  lower  case  alike 
for  both  documents  and  queries.  Since  acronyms  and 
brand  names  have  different  meanings  sometimes  from 
uncapitaUzed  words  having  the  same  letters,  perhaps 
there  is  a  way  to  take  the  case  of  letters  into  account 
when  computing  a  match.  That  is,  we  could  count  a 
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match  between  two  words  with  different  capitaUzations 
as  a  good  match,  but  a  match  between  two  words  with 
the  same  capitalizations  as  a  better  one. 

•  Our  system  is  basically  just  a  text  filter.  As  such,  it  is 
already  close  to  being  useful  for  various  routing  tasks. 
However,  to  use  N-gram-based  matching  for  on-line 
retrieval,  we  will  have  to  implement  a  true  index-based 
system.  Ideally,  we  would  Uke  to  integrate  the  text 
retrieval  capability  of  our  ryiew  system  with  the  flexibil- 
ity of  the  multi-query  filtering  system  described  here. 
Although  we  have  an  initial  design  sketched  out,  it  will 
take  an  investment  of  further  time  and  machine 
resources  to  implement  this  idea  and  test  it.  We  will  also 
be  using  some  of  our  new  compact  N-gram  representa- 
tion techniques  to  reduce  the  large  amount  of  index  stor- 
age and  computation  required. 
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Retrieval  of  Partial  Documents 


Alistair  Moffat*       Ron  Sacks-Davis^       Ross  Wilkinson*       Justin  Zobel^ 


Information  systems  usually  retrieve  whole  doc- 
uments as  answers  to  queries.  However,  it  may 
in  some  circumstances  be  more  appropriate  to  re- 
trieve parts  of  documents.  These  parts  could  be 
formed  by  arbitrary  division  of  running  text  into 
pieces  of  similar  length,  or  by  considering  the  doc- 
ument's hierarchical  structure.  Here  we  consider 
how  to  break  documents  into  parts,  how  to  imple- 
ment retrieval  of  parts,  and  the  impact  of  division 
of  documents  on  retrieval  effectiveness. 

1  Introduction 

Provision  of  answers  to  informally  phrased  ques- 
tions is  a  central  part  of  information  retrieval. 
These  answers  traditionally  take  the  form  of  doc- 
uments retrieved  from  a  text  database,  but  doc- 
uments will  often  be  unsatisfactory  as  answers. 
They  may  be  large  and  unwieldy;  the  answer  they 
represent  may  be  diffuse,  and  therefore  hard  for 
the  user  to  extract;  and  word-based  retrieval  sys- 
tems may  be  misled  by  the  breadth  of  vocabu- 
lary of  a  long  document  into  believing  it  to  be 
relevant. 

Indexing  and  returning  parts  of  documents  ad- 
dresses these  problems.  We  have  approached  the 
problem  of  partial  documents  in  two  ways.  The 
first  approach  is  to  regard  documents  as  an  un- 
structured series  of  "pages"  of  text  of  similar 
length,  each  of  which  can  be  returned  as  an  an- 
swer to  a  query.  We  would  expect,  under  this 
approach,  that  any  bias  in  the  retrieval  mecha- 
nism towards  documents  of  a  particular  length 
should  be  eliminated.  By  regarding  an  answer  to 
be  the  document  from  which  an  answer  page  is 
drawn,  paging  can  be  used  even  in  contexts  where 
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documents  are  required  as  answers,  as  is  the  case 
for  the  TREC  experiments. 

Breaking  documents  into  pages,  however,  has 
implications  for  implementation:  the  growth  in 
the  number  of  candidate  answers  is  such  that 
current  approaches  for  evaluating  queries  have 
unacceptable  memory  requirements  and  response 
times.  We  have  developed  new  algorithms  for 
implementing  information  retrieval  methods  on 
large  collections,  concentrating  on  the  cosine 
measure  with  IDF  term  weights  as  a  typical  ex- 
ample. These  include  techniques  for  efficiently 
constructing  and  compressing  large  inverted  files, 
and  for  restricting  the  amount  of  memory  space 
and  processing  time  required  during  query  eval- 
uation. The  result  is  that  we  are  able  to  iden- 
tify answers  in  a  fraction  of  the  space  and  time 
required  by  previous  methods.  Using  these  tech- 
niques on  a  version  of  the  TREC  data  in  which 
the  documents  are  broken  into  over  1.7  million 
pages,  answers  can  be  found  more  quickly  than 
they  could  previously  be  found  on  the  unpaged 
data,  even  though  the  latter  has  a  smaller  index 
and  fewer  records.  Section  2  gives  an  overview  of 
these  techniques. 

Our  second  approach  to  the  problem  of  par- 
tial documents  was  to  regard  documents  as  hi- 
erarchical structures.  Some  of  the  documents  in 
TREC  are  very  large.  It  is  not  clear  that  it  is 
desirable  to  return  such  documents  as  a  whole, 
nor  is  it  clear  that  these  documents  should  be  in- 
dexed as  a  whole.  Most  of  the  long  documents 
in  TREC  are  from  the  Federal  Register  collection 
and  all  have  some  degree  of  structure  associated 
with  them.  We  conducted  a  set  of  experiments 
that  attempted  to  determine  whether  these  doc- 
uments should  be  indexed  as  single  objects,  and 
whether  the  documents'  structure  could  be  used 
in  conjunction  with  the  contents  of  its  elements. 

We  also  experimented  with  retrieval  of  par- 
tial documents  and  investigated  whether  context, 
that  is  the  rank  of  the  whole  document,  helped 
improve  ranking  of  sections.  The  experiments 
with  hierarchical  structures  are  necessarily  based 
on  the  small  set  of  longer  documents  in  the  TREC 
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1.  Set  prev  <—  undefined  and  curr  <—  wi. 

2.  For  each  j  from  2  to  n, 

(a)  If  curr  >  B,  emit  prev  if  it  is  defined, 
set  prev  «~  curr,  and  set  curr  <— 

(b)  Otherwise,    if  Wj    >    prev  then  set 
preu  <—  prev  +  curr  and  set  curr  <—  tyj. 

(c)  Otherwise,  set  curr      curr  +  Wj. 

3.  If  curr  <  B  then  set  prew  <—  prev  +  curr  and 
emit  prev.  Otherwise,  emit  prev  followed  by 
curr. 


collection.  We  would  expect  that,  in  a  large 
database  of  such  documents,  the  implementation 
techniques  we  developed  for  unstructured  text 
could  be  used  for  structured  documents  almost 
without  change. 

2    Retrieval  of  paged  text 

The  first  issue  that  must  be  considered  when  ac- 
cessing a  set  of  documents  as  a  collection  of  pages 
is  the  pagination  strategy;  one  possible  method 
for  pagination  is  discussed  in  Section  2.1. 

Then,  given  paged  text,  answers  must  be  found 
in  response  to  documents.  In  an  inverted  file  text 
database  system  of  the  kind  we  have  been  devel- 
oping [6,  9],  the  implications  of  paging  are  a  large 
increase  both  in  the  number  of  accumulators  used 
to  hold  the  intermediate  cosine  values  and  in  the 
length  of  the  inverted  file  entries.  The  former 
means  that  evaluating  a  ranking  will  require  more 
memory;  while  the  latter  implies  a  longer  time  to 
resolve  queries.  Methods  for  avoiding  these  in- 
creases are  outlined  in  Sections  2.2  and  2.3;  a  full 
description  is  given  elsewhere  [10]. 

Finally,  there  is  the  impact  on  effectiveness 
of  the  decision  to  paginate  the  documents.  Ex- 
perimental results  comparing  document  retrieval 
with  page  retrieval  are  described  in  Section  2.4. 

2.1     Breaking  text  into  pages 

One  way  to  store  text  in  a  database  is  as  a  set 
of  records,  each  containing  a  single  document. 
Such  an  organisation  implies  that  entire  docu- 
ments must  be  retrieved  in  response  to  queries. 
Moreover,  a  query  can  match  long  documents  in 
which  its  words  are  widely  separated  and  could 
well  be  unrelated,  so  this  storage  strategy  can  re- 
sult in  irrelevant  material  being  retrieved. 

An  alternative  way  of  presenting  text  is  as  a 
series  of  pages.  There  are  several  advantages  to 
using  pages: 

•  they  are  of  a  more  manageable  size  than 
whole  documents,  and  the  size  variation  be- 
tween pages  can  be  constrained  to  be  far  less 
than  the  size  variation  between  documents; 

•  if  a  user  looks  for  answers  matching  a  query, 
it  is  likely  that  in  relevant  material  the 
query's  words  will  occur  close  together  in  the 
retrieved  text; 

•  retrieving  a  page  of  text  to  display  may  be 
considerably  cheaper  than  retrieving  an  en- 
tire document; 


Figure  1:  Paging  algorithm 

•  in  many  applications  it  is  natural  to  regard 
documents  as  consisting  of  parts  rather  than 
as  a  whole;  for  example,  the  nodes  in  a  hy- 
pertext system  or  the  sections  of  a  book;  and 

•  only  parts  of  documents  can,  in  general,  fit 
on  a  screen,  and  people  can  only  comprehend 
part  of  a  document  at  a  time. 

For  these  reasons — and  because  of  the  challenge 
posed  by  the  sheer  size  of  a  paged  version  of  the 
TREC  collection — experiments  were  undertaken 
with  paged  text. 

The  paging  strategy  adopted  first  breaks  doc- 
uments into  minimal  units  likely  to  be  useful  for 
ranking.  In  most  collections  this  unit  would  prob- 
ably be  the  paragraph.  The  algorithm  shown  in 
Figure  1  is  then  used  to  gather  paragraphs  into 
pages,  where  Wi  is  the  length  of  the  i'th  para- 
graph of  the  original  document  and  B  is  the  tar- 
get length  of  each  constructed  page.  Documents 
of  fewer  than  B  bytes  must,  of  course,  be  allo- 
cated singleton  pages,  but  all  other  pages  are  B 
bytes  or  more.  Moreover,  the  aim  is  that  only 
a  small  number  of  pages  are  significantly  longer 
than  B  bytes.  The  algorithm  requires  time  linear 
in  the  length  of  the  text  and  a  constant  amount 
of  space,  and  so  is  a  small  additional  step  during 
database  construction.  For  the  experiments  de- 
scribed below,  length  was  measured  in  bytes  (the 
alternatives  being  words,  or  some  more  abstract 
length  value  based  upon  term  frequencies),  and 
B  =  1,000  was  used. 

The  pagination  of  the  2,055  Mb  TREC  collec- 
tion resulted  in  the  742,358  documents  being  con- 
verted into  1,743,848  pages  of  average  size  1,235 
bytes  each. 
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1. 

Order  the  words  in  the  query  from  highest 

weight  to  lowest. 

2. 

Set  A  *—  9;  A  is  the  current  set  of  accumula- 

3. 

For  ezLch  word  t  in  the  query, 

(a)  Retrieve  /t,  the  inverted  file  entry  for  t. 

(b)  For  each  {d,fd,t)  pair  in  It, 

i.  If  the  accumulator  Ad     A,  calcu- 

late Ad  ^  Ad  +  Wq^t  ■  Wd.t- 

ii.  Otherwise,  set  A  <—  yl  +  {y4ci},  cal- 

r'lilat**   A  J  4       III    ^  •  m  J  ^ 

(c)  11  \A\  >  L,  go  to  step  4. 

4. 

For  each  document  d  such  that  Ad  €  A,  cal- 

culate Cd  <—  Ad/IVd. 

5. 

Identify  the  r  highest  values  of  Cd  using  a 

heap. 

2.2    Management  of  accumulators 

An  effective  vpay  to  rank  documents  against  a 
query  is  with  the  cosine  measure,  defined  by  the 
function 

where  q  is  the  query,  d  is  the  document,  and  w^^t 
is  the  weight  of  word  t  in  document  or  query  x. 
We  assume  that  Wd  =  VilZt  '^dt)  ^^e  length 
of  document  d,  and  that  the  top  r  answers  are  to 
be  retrieved  and  returned  to  the  user. 

In  our  experiments  term  weights  were  calcu- 
lated using  the  IDF  rule  wa^t  =  fd,t  ■  ^og{N/ft), 
where  fd,t  is  the  number  of  appearances  of  term 
t  in  document  d,  N  is  the  number  of  documents 
in  the  collection,  and  ft  is  the  number  of  docu- 
ments that  contain  term  t.  Harman  [3]  gives  a 
summary  of  ranking  techniques  and  a  discussion 
of  the  cosine  measure  and  IDF  weighting  rule. 

To  support  the  cosine  measure  in  an  inverted 
file  text  database  system,  each  inverted  file  entry 
contains  a  sequence  of  {d,  fd,t)  pairs  for  some  term 
t.  The  value  ft,  the  number  of  documents  con- 
taining t,  is  stored  in  the  lexicon,  and  from  these 
two  the  cosine  contribution  fd^t  ■  (log(-^//<))^  °f 
term  t  in  document  d  can  be  calculated.  A  struc- 
ture of  accumulators  is  used  to  collect  these  con- 
tributions. As  the  inverted  file  entry  for  each 
query  term  is  processed,  the  accumulator  value 
for  each  document  number  in  the  inverted  file 
entry  is  updated  by  adding  in  each  partial  co- 
sine value  as  it  is  calculated,  so  that  by  the  time 
all  inverted  file  entries  have  been  processed  the 
accumulator  for  document  d  contains  the  value 

fd,t  •  (iog(i\^//t))' = ^,,t  ■ 
t  t 

The  number  of  accumulators  required  to  pro- 
cess a  query  can  be  large.  Queries  were  created 
from  TREC  topics  51-100  by  simply  extracting 
and  stemming  all  of  the  alphanumeric  strings, 
and  stopping  601  closed-class  words.  The  gen- 
erated queries  had  an  average  of  over  40  terms, 
and  each  resulted  in  about  75%  of  the  pages 
of  TREC  having  a  non-zero  accumulator  value. 
With  this  ratio  the  most  memory-efficient  accu- 
mulator structure  is  an  array.  But  at  32  or  64 
bits  per  accumulator,  this  structure  is  formidably 
large,  and  dominates  the  retrieval-time  mem- 
ory requirements.  Even  initialising  it  is  time- 
consuming. 


Figure  2:  QuU  algorithm  for  computing  cosine 
using  approximately  L  accumulators 

A  simple  strategy  for  restricting  the  number  of 
accumulators  is  to  order  query  terms  by  decreas- 
ing weight,  and  only  process  terms  until  some 
designated  termination  condition  is  met.  One 
such  condition  is  to  impose  an  a  priori  bound  L 
on  the  number  of  non-zero  accumulators,  and  to 
stop  processing  terms  when  this  bound  is  reached. 
Such  a  ranking  process,  which  we  call  quit,  is  il- 
lustrated in  Figure  2.  Other  possible  termination 
conditions  would  be  to  place  a  limit  on  the  num- 
ber of  terms  considered  or  on  the  total  number 
of  pointers  decoded;  or  to  place  an  upper  bound 
on  the  term  frequency  ft ,  and  only  process  terms 
that  appear  in  fewer  than  x%  of  the  documents, 
for  some  predetermined  value  x.  Buckley  and  Le- 
wit  [1];  Lucarella  [8];  and  Wong  and  Lee  [13]  have 
also  considered  various  stopping  rules,  and  derive 
conditions  under  which  inverted  file  entries  can  be 
discarded  without  affecting  the  r  top  documents 
(although  their  order  within  the  ranking  may  be 
different).  Here  we  are  prepared  to  allow  approx- 
imate as  well  as  exact  rules,  acknowledging  that 
the  cosine  measure  is  itself  a  heuristic. 

Quitting  has  the  advantage  of  only  process- 
ing the  inverted  file  entries  of  a  subset  of  the 
query  terms,  and  hence  of  faster  ranking,  but 
at  the  possible  expense  of  poor  retrieval  perfor- 
mance, depending  upon  how  discriminating  the 
low  weighted  terms  are. 

An  alternative  is  to  continue  the  processing  of 
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1.  Order  the  words  in  the  query  from  highest 
weight  to  lowest. 

2.  Set  A<^9. 

3.  For  each  word  t  in  the  query, 

(a)  Retrieve  It- 

(b)  For  each  {d,  fd,t)  pair  in  It, 

i.  If  Ad  G  A,  calculate  Ad      Ad  + 

Wq,t  ■  Wd,t- 

ii.  Otherwise,  set  A  «—  A  +  {Ad},  cal- 
culate Ad  <—  Wq^t  ■  Wd,t- 

(c)  If  \A\  >  L,  go  to  step  4. 

4.  For  each  remaining  word  t  in  the  query, 

(a)  Retrieve  It. 

(b)  For  each  d  such  that  Ad  €  A, 

if  {d,fd,t)  €  It,  calculate  Ad  ^  Ad  + 

Wq,t  ■  Wd,t- 

5.  For  each  document  d  such  that  Ad  G  A,  cal- 
culate Cd  <—  Ad/Wd. 

6.  Identify  the  r  highest  values  of  Cd- 


Figure  3:  Continue  algorithm  for  using  approxi- 
mately L  accumulators 

inverted  file  entries  after  the  bound  on  the  num- 
ber of  accumulators  is  reached,  but  allovi^  no  new 
documents  into  the  accumulator  set.  This  con- 
tinue algorithm  is  illustrated  in  Figure  3.  The 
algorithm  has  two  distinct  phases.  In  the  first 
phase,  accumulators  are  added  freely,  as  in  the 
quit  algorithm.  In  the  second  stage,  existing  ac- 
cumulator values  are  updated  but  no  new  accu- 
mulators are  added.  Both  quit  and  continue  gen- 
erate the  same  set  of  approximately  L  candidate 
answers,  but  in  a  different  permutation,  so  when 
the  top  r  documents  are  extracted  from  this  set 
and  returned,  different  retrieval  effectiveness  can 
be  expected,  particularly  if  r  <C  i^. 

A  similar  strategy  to  that  of  continue  is  de- 
scribed by  Harman  and  Candela  [3,  4],  although 
their  motivation  is  somewhat  different — they  de- 
sire a  small  number  of  accumulators  to  reduce 
the  sorting  time  required  for  a  total  ranking  of 
the  collection,  whereas  we  assume  that  only  a 
small  fraction  of  the  collection  is  to  be  presented. 
In  this  latter  case,  to  find  the  top  r  documents 
a  heap  is  the  appropriate  data  structure,  and 
0{N  +  rlogN)  time  is  required,  a  small  fraction 
of  query  processing  time. 

Figure  4  shows  retrieval  effectiveness,  mea- 


sured as  an  llpt  average  precision,  as  a  function 
of  k  =:  \A\,  the  number  of  accumulators  actually 
used.  The  small  numbers  beside  the  continue 
curve  show  the  average  number  of  terms  pro- 
cessed in  phase  one  of  each  query.  For  example, 
only  8.2  terms  are  needed  to  generate  27,000  ac- 
cumulators. The  difference  between  quit  and  con- 
tinue is  marked,  and,  perhaps  surprisingly,  even 
the  mid  to  low  weight  terms  appear  to  contribute 
to  the  effectiveness  of  the  cosine  rule.  Ignoring 
them  leads  to  significantly  poorer  retrieval. 
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Figure  4:  Effectiveness  of  quit  and  continue 

In  these  experiments  direct  assessment  of  the 
relevance  of  pages  was  not  possible  because  the 
relevance  judgements  were  for  whole  documents. 
To  measure  effectiveness,  it  was  assumed  that  if  a 
document  was  relevant  then  each  page  from  it  was 
relevant,  but  only  the  most  highly  ranked  page 
from  each  document  was  returned  and  counted. 
We  believe  this  method  to  be  fair,  since  one  way 
in  which  pages  might  be  used  would  be  to  rank  on 
pages  but  return  whole  documents,  with  perhaps 
some  highlighting  of  the  top-ranked  pages.  Other 
strategies  for  using  parts  of  documents  to  select 
whole  documents  are  discussed  in  Section  3. 

Also  surprising  is  that  the  continue  strategy, 
with  restricted  numbers  of  accumulators,  is  capa- 
ble of  better  retrieval  performance  than  the  orig- 
inal cosine  method,  in  which  all  pages  are  per- 
mitted accumulators.  It  would  appear  that  the 
mid  to  low  weight  terms,  while  contributing  to 
retrieval  effectiveness,  should  not  be  permitted 
to  collaborate  and  select  documents  that  contain 
none  of  the  more  highly  weighted  terms. 
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Figure  5:  CPU  time  of  quH  and  continue 

Whilst  the  results  for  retrieval  effectiveness  are 
encouraging,  the  results  for  time  are  not.  Figure  5 
shows  the  cost  of  these  two  strategies  in  terms  of 
cpu  time,  with  the  set  A  of  accumulators  stored 
as  a  hash  table.  The  times  shown  include  all 
processing  required  to  identify  the  top  r  =  200 
ranked  documents,  but  do  not  include  the  cost 
of  actually  retrieving  those  documents,  which,  in 
a  system  such  as  ours,  in  which  the  text  is  also 
compressed,  involves  further  decoding  effort.  All 
timings  are  for  a  lightly  loaded  Sun  SPARCsta- 
tion  Model  512  using  a  single  processor;  programs 
were  written  in  C  and  compiled  using  gcc.  While 
for  values  of  L  less  than  about  100,000  the  con- 
tinue method  takes  no  longer  than  the  simple  co- 
sine measure  when  implemented  using  an  array 
of  accumulators,  it  is  clearly  unsatisfactory  com- 
pared with  the  quit  technique.  In  the  next  section 
we  discuss  methods  by  which  this  situation  can 
be  improved. 

2.3    Processing  long  inverted  file  entries 

Compression  can  reduce  index  size  by  a  factor 
of  six;  for  example,  for  the  paged  form  of  TREC 
the  size  reduces  from  1,120  Mb  to  184  Mb,  an 
irresistible  saving.  Compression,  however,  pro- 
hibits random  access  into  inverted  file  entries,  so 
that  the  whole  of  each  inverted  file  entry  must 
be  decoded,  even  though  not  every  {d,  fd^t)  pair 
is  required.  This  is  the  reason  that  the  continue 
strategy  is  slow. 

The  need  to  decompress  an  inverted  file  en- 


try in  full  can  be  avoided  by  including  a  series 
of  synchronisation  points  at  which  decoding  can 
commence  [10].  These  can  be  arranged  as  a  series 
of  pointers,  or  skips,  as  illustrated  in  Figure  6. 


skip 


document  numbers 


inverted  file  entry 


Figure  6:  Adding  skips 

The  skips  divide  the  inverted  file  entry  into  a 
series  of  blocks,  and  to  access  a  number  in  the 
entry,  only  the  skips  and  the  block  containing 
the  number  need  to  be  decoded.  Appropriately 
coded,  such  skips  increase  the  size  of  the  com- 
pressed inverted  file  by  only  a  few  percent,  but 
can  drastically  reduce  the  amount  of  decoding  re- 
quired. Decode  time  for  a  skipped  inverted  file 
entry  is  given  hy  T  =  td  {2s  +  Lp/2s) ,  where  td 
is  the  time  needed  to  decode  one  {d,  fd^t)  pair,  p 
is  the  number  of  such  pairs  in  the  inverted  file 
entry,  L  is  the  number  of  accumulators,  and  s 
is  the  number  of  skips.  This  time  is  minimised 
at  s  —  y/Lp/2.  For  10,000  accumulators  and 
an  inverted  file  entry  of  length  100,000  (a  com- 
mon figure  for  queries  to  the  paged  TREC),  to- 
tal time  including  reading  from  disk  on  a  typi- 
cal system  drops  from  0.30  seconds  to  0.22  sec- 
onds [10].  Moreover,  under  the  same  assump- 
tions an  uncompressed  inverted  file  entry  of  this 
length  would  take  0.30  seconds  to  read,  and  so  the 
skipped  compressed  inverted  file  provides  faster 
ranking  than  even  an  uncompressed  inverted  file. 

To  test  skipping,  inverted  files  were  con- 
structed for  several  different  fixed  values  of  L  and 
then,  for  each  inverted  file,  the  cpu  time  for  in- 
verted file  processing  measured  for  a  range  of  ac- 
tual L  values.  To  ensure  that  inverted  files  did 
not  become  too  large,  a  minimum  block  size  was 
imposed,  requiring  that  every  skip  cover  at  least 
four  {d,fd,t)  pairs.  The  results  of  these  experi- 
ments are  shown  in  Figure  7.  As  can  be  seen, 
substantially  faster  query  processing  is  the  result 
when  the  number  of  accumulators  is  less  than 
about  100,000.  As  predicted  by  the  analysis,  the 
index  constructed  assuming  that  L  =  1,000  gives 
the  best  performance  when  the  number  of  accu- 
mulators (the  variable  L  in  Figure  3)  is  small. 
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Figure  7:  Time  required  by  skipping 

2.4    Retrieval  experiments 

The  implementation  techniques  described  above, 
skipping  and  limitation  of  the  number  of  accu- 
mulators, can  be  applied  to  document  retrieval 
as  well  as  to  page  retrieval.  Table  1  summarises 
the  various  resources  required  and  performance 
achieved  by  both  document  and  page  retrieval, 
for  various  values  of  L,  the  nominal  number  of  ac- 
cumulators. The  following  points  should  be  noted 
in  connection  with  the  data  in  Table  1. 

File  sizes  Stemming  was  applied  during  in- 
verted file  construction,  but  no  words  were 
stopped.  All  alphanumeric  strings  were 
indexed,  totalling  333,000,000  term  occur- 
rences of  538,244  distinct  terms.  The  doc- 
ument index  contained  136,000,000  {d,  fd^t) 
pointers,  the  paged  index  196,000,000  such 
pairs.  Each  is  compressed  to  occupy  less 
than  one  byte.  The  variation  in  inverted  file 
size  is  due  to  the  insertion  of  skips,  or,  in  the 
case  of  the  "All"  row,  their  absence. 

The  text  itself  was  also  stored  compressed 
using  a  word-based  model  [9],  allowing  the 
2,055  Mb  to  be  reduced  to  605  Mb.  A  fur- 
ther 27  Mb  of  smaller  auxiliary  files  were 
also  required.  In  total,  the  complete  retrieval 
system  for  the  paged  TREC  occupied  about 
817  Mb,  or  40%  of  the  initial  unindexed  text. 

Retrieval  time  The  SPARCstation  512  used 
for  these  experiments  is  approximately  2.2 
times  faster  than  the  SPARCstation  2  used 


for  the  first  round  of  experimentation  [6,  9]. 
The  times  listed  cover  all  activity  from  issue 
of  query  until  a  ranked  list  of  200  document 
numbers  is  calculated. 

Fetching  of  the  200  answers,  which  occupy 
226.9  Kb  compressed,  adds  a  uniform  1.9  sec 
per  query  including  the  cost  of  decompres- 
sion. Decompressed,  there  is  an  average  of 
829.9  Kb  of  output  text  per  query. 

Elapsed  times  during  the  ranking  process 
are  generally  about  1.6  sec  greater  than  cpu 
time,  in  the  course  of  which  an  average  of 
42  disk  accesses  are  made  to  the  lexicon;  42 
disk  accesses  made  to  the  inverted  file  itself 
(fetching  a  total  of  1.65  Mb  of  data,  contain- 
ing about  two  million  compressed  (d,  fd,t) 
pairs);  and  275  accesses  to  the  combined  file 
of  document  weights  and  addresses.  All  of 
these  files  except  the  inverted  file  are  small, 
and  likely  to  have  been  buffered  into  main 
memory,  hence  the  small  overhead. 

Elapsed  time  for  the  presentation  of  docu- 
ments was  about  5.0  sec  per  query  greater 
than  cpu  time,  caused  by  the  need  to  per- 
form a  further  200  seeks  into  the  file  con- 
taining the  compressed  text.  This  file  is  too 
large  for  there  to  have  been  any  buffering  ef- 
fect. 

Memory  space  When  skipping  is  employed 
(the  first  three  rows  in  each  section  of  the  ta- 
ble) memory  space  during  query  processing 
is  proportional  to  the  number  of  non-zero  ac- 
cumulators. In  these  experiments  a  hash  ta- 
ble was  used,  and  an  average  of  14  bytes  per 
accumulator  required.  For  the  "All"  exper- 
iments an  array  of  accumulators  was  used, 
2.8  Mb  for  document  retrieval,  and  6.6  Mb 
for  page  retrieval. 

An  array  of  6-bit  approximate  document 
lengths  was  used  to  guide  the  retrieval  pro- 
cess [11],  requiring  0.5  Mb  for  document  re- 
trieval and  1.2  Mb  for  page  retrieval. 

Terms  processed  The  values  for  "Terms  pro- 
cessed" indicate  the  average  number  of  terms 
processed  before  processing  switched  from 
the  first  to  the  second  phase  of  continue.  For 
each  query  a  whole  number  of  inverted  file 
entries  were  processed,  and  this  is  why  the 
"non-zero  accumulators"  average  is  greater 
than  the  target  value  L.  Not  surprisingly, 
with  page  retrieval  fewer  terms  were  re- 
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Inverted  file 

Retrieval  Time 

Non-zero 

(Mb) 

(cpu 

sec) 

accumulators 

Doc. 

Page 

Doc. 

Page 

Doc. 

Page 

L  =  1,000 

145.8 

205.4 

2.36 

2.95 

2,330 

2,701 

L  =  10,000 

156.7 

220.3 

4.46 

5.57 

12,855 

13,238 

L  =  100,000 

161.6 

230.2 

9.80 

12.00 

107,737 

109,370 

All  (array) 

128.6 

184.4 

7.66 

12.48 

582,783 

1,307,115 

(a)  Resources  required 


Terms 

Precision  at  200 

Pessimal  llpt 

processed 

average 

Doc. 

Page 

Doc. 

Page 

Doc. 

Page 

L  =  1,000 

3.08 

2.84 

0.271-0.612 

0.260-0.639 

0.164 

0.166 

L  =  10,000 

6.80 

6.06 

0.346-0.530 

0.319-0.554 

0.185 

0.173 

L  =  100,000 

16.84 

14.26 

0.333-0.446 

0.326-0.504 

0.175 

0.175 

All  (array) 

42.44 

42.44 

0.331-0.444 

0.321-0.502 

0.174 

0.173 

(b)  Retrieval  effectiveness 


Table  1:  Document  vs.  paragraph  retrieval:  (a)  resources  required,  and  (b)  retrieval  effectiveness 


quired,  on  average,  to  consume  the  available 
accumulators. 

Precision  at  200  The  range  of  values  in  this 
column  is  to  show  the  fraction  of  unjudged 
documents  that  w^ere  accessed.  The  lov?er 
number  assumes  that  all  unjudged  docu- 
ments are  irrelevant;  the  upper  is  calculated 
assuming  that  all  are  relevant. 

llpt  effectiveness  The  "llpt  average"  values 
are  calculated  based  solely  upon  the  top  200 
ranked  documents,  assuming  that  all  remain- 
ing relevant  documents  are  ranked  last,  and 
that  all  unjudged  documents  are  not  rele- 
vant. 

Index  construction  New  algorithms  have  been 
developed  for  index  construction.  Using 
these  algorithms,  the  index  was  built  in  un- 
der 4  hours,  using  a  peak  of  40  Mb  of  main 
memory  and  less  than  50  Mb  of  temporary 
disk  space  above  and  beyond  the  final  size 
of  the  inverted  file.  The  compression  of  the 
documents  took  a  further  4  hours,  for  a  total 
database  build  time  of  under  8  hours. 

Based  upon  these  experiments  we  conclude  that: 

•  use  of  a  limited  number  of  accumulators  does 
not  appear  to  impact  retrieval  effectiveness, 
and  so  is  an  extremely  attractive  heuristic  for 
ranking  large  collections  because  of  the  dra- 
matic savings  in  retrieval-time  memory  us- 
age that  result; 


•  introduction  of  skips  to  the  compressed  in- 
verted file  entries  significantly  reduces  pro- 
cessing time  in  this  restricted-accumulators 
environment; 

•  index  compression  can,  in  this  way,  become 
"free":  if  only  partial  decoding  is  required, 
the  input  time  saved  by  compression  can  be 
more  than  enough  to  pay  the  cpu  cost  of  frac- 
tional decoding; 

•  all  non-stopped  terms  should  be  allowed  to 
contribute  to  the  ranking,  and  that  the  quit 
strategy  is  inferior  to  continue;  and 

•  even  relatively  simple  pagination  gives  re- 
trieval not  mecisurably  inferior  to  document 
retrieval. 

When  dealing  with  pages,  the  retrieval  system 
was  implemented  to  display  the  entire  document, 
but  highlight  the  page  or  pages  that  had  caused 
the  document  to  be  ranked  highly.  This  deci- 
sion made  the  paged  collection  particularly  easy 
to  use,  and  when  we  browsed  the  answers  to 
the  queries  it  was  very  satisfying  to  be  able  to 
find  the  exact  text  that  had  triggered  the  match. 
This  advantage  of  pagination  became  clear  even 
in  the  early  stages  of  the  project,  while  we  were 
attempting  to  debug  the  implementation  of  the 
cosine  method. 

On  the  other  hand,  a  problem  that  was 
brought  out  by  these  experiments  was  the  dif- 
ficulty of  comparing  retrieval  effectiveness  in 
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the  face  of  non-exhaustive  relevance  judgements. 
When  precision  rates  are  around  30%,  and  a  fur- 
ther (in  the  L  =  1,000  case)  30%  of  documents 
are  unjudged,  there  can  be  no  significance  what- 
soever attached  to  the  difference  between  even 
30%  precision  and  40%  precision.  Indeed,  as- 
suming that  27.1%  of  the  unjudged  documents 
are  relevant  for  the  "L  =  1,000;  Doc"'  combina- 
tion gives  a  final  precision  of  0.411;  the  corre- 
sponding number  for  the  "All;  Doc"  pairing  is 
only  0.373.  Thus,  the  precision  figures  of  Table  1 
are  sufficiently  imprecise  that  no  conclusion  can 
be  drawn  about  the  appropriate  value  of  L  that 
should  be  used,  and  about  the  merits  of  docu- 
ment vs.  paged  retrieval.  There  is  clearly  scope 
for  research  into  other  methodologies  for  compar- 
ing retrieval  mechanisms. 

3    Structured  documents 

Many  of  the  documents  in  the  TREC  collection 
are  very  large  and  have  explicit  structure,  and 
it  may  be  possible  to  use  this  structure — rather 
than  the  statistically  based  pagination  methods 
described  above — to  break  documents  into  parts. 
In  particular,  many  documents  can  be  broken  up 
into  a  set  of  sections,  each  section  having  a  type. 
There  has  been  relatively  little  work  done  on  re- 
trieving or  ranking  partial  documents.  However, 
Salton  et  al.  [12]  have  demonstrated  that  docu- 
ment structure  can  be  valuable.  Sometimes  this 
structure  is  explicitly  available  [2],  and  sometimes 
it  has  to  be  discovered  [5],  but  the  knowledge 
of  this  structure  has  been  shown  to  help  deter- 
mine the  relevance  of  sub-documents.  In  this 
part  of  the  work  we  used  a  small  database  to 
investigate  whether  retrieval  of  sections  helped 
document  retrieval,  and  whether  retrieval  of  doc- 
uments helped  section  retrieval.  By  way  of  a 
benchmark,  the  paged  retrieval  techniques  de- 
scribed earlier  were  applied  to  the  same  database. 

3.1    The  database 

Since  we  needed  information  about  the  relevance 
of  sections  to  queries  it  was  not  possible  to  use  the 
full  TREC  database.  Instead,  we  used  a  database 
consisting  of  4,000  documents  extracted  from  the 
Federal  Register  collection.  These  documents 
were  selected  as  being  the  2,000  largest  docu- 
ments which  were  relevant  to  at  least  one  of  topics 
51-100  provided  for  the  first  TREC  experiment. 
Another  2,000  documents  were  randomly  selected 
from  the  Federal  Register  collection  to  provide 


both  smaller  documents  and  non-relevant  docu- 
ments. The  average  number  of  words  in  these 
documents  was  3,260. 

These  documents  were  then  split  into  sections 
based  on  their  internal  markup.  The  documents 
had  a  number  of  tags  inserted  that  defined  an  in- 
ternal structure.  It  appeared  that  only  the  T2. 
and  T3  tags  could  be  reliably  used  to  indicate  a 
new  internal  fragment.  Section  breaks  were  de- 
fined to  be  a  blank  line,  or  a  line  containing  only 
markup,  followed  by  a  T2  or  a  T3  tag.  This  led 
to  a  database  of  32,737  sections.  Each  of  these 
sections  had  a  type  based  on  its  tag.  The  types 
were  (purpose),  (abstract),  (start),  (summary), 
(title),  (supplementary),  and  a  general  category 
(misc)  that  included  all  remaining  categories. 

Having  made  the  document  selections,  only  19 
of  the  queries  51-100  had  a  relevant  document 
in  the  collection.  Each  of  the  sections  for  doc- 
uments that  had  been  judged  as  relevant  was 
judged  for  relevance  against  these  queries  so  that 
finer  grained  retrieval  experiments  were  possi- 
ble. One  difficulty  that  arose  was  that  quite  a 
few  documents  that  had  been  judged  relevant  ap- 
peared to  have  no  relevant  sections — there  were 
relevant  key  terms  in  the  documents  but  the  doc- 
uments themselves  did  not  appear  to  address  the 
information  requirement.  There  were  145  such 
(query,  document)  pairs.  To  be  consistent,  we 
took  these  document  to  be  irrelevant.  After  these 
alterations,  only  14  queries  had  a  relevant  section, 
and  there  were  an  average  of  23  relevant  sections 
per  query. 

3.2    Structured  r2inking 

We  carried  out  a  set  of  experiments  on  rank- 
ing documents  using  the  retrieval  of  sections. 
We  first  compared  simple  ranking  of  documents 
against  ranking  sections  to  find  relevant  docu- 
ments. Next,  a  set  of  formulae  were  devised  that 
attempted  to  use  the  fact  that  one  document  has 
several  sections  that  might  be  more  or  less  highly 
ranked.  These  took  into  consideration  the  rank 
of  the  section,  the  number  of  ranked  sections,  and 
the  number  of  sections  in  the  document.  Exper- 
iment 3  describes  one  of  the  more  successful  for- 
mulas. 

Further  trials  were  then  performed  using  the 
type  of  the  section.  First,  a  set  of  experiments 
were  run  that  determined  which  types  were  bet- 
ter predictors  of  relevance.  These  results  were 
then  used  to  devise  a  measure  that  used  a  weight 
for  each  type.  Finally,  we  tried  to  combine  these 
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results  to  see  if  it  was  helpful  to  use  the  rank  of 
the  whole  documents  along  with  the  rank  of  its 
component  parts. 

Experiment  1:  Rank  full  documents  against 
the  queries  using  standard  cosine  measure. 

Experiment  2:  Split  documents  into  sections. 
Measure  similarity  of  each  section  against 
the  queries  using  standard  cosine  mea- 
sure. Order  documents  based  on  the  highest 
ranked  section. 

Experiment  3:  Split  documents  into  sections. 
Measure  similarity  of  each  section  against 
the  queries  using  standard  cosine  measure. 
Order  documents  based  on 

where  s„  is  the  n"*  highest  ranked  section  in 
the  document.  The  effect  of  this  formula  is 
that  a  document's  weight  is  determined  by 
a  decay  formula  using  the  weight  of  all  of  a 
document's  sections.  (Values  other  than  0.5 
were  tried,  but  gave  poorer  performance.) 

Experiment  4:  Split  documents  into  sections. 
Measure  similarity  of  each  section  against 
the  queries  using  standard  cosine  measure. 
Then  weight  each  section  using  its  type,  for 
example  (introduction)  or  (address).  Order 
documents  based  on 

^tw{type{sn))0.5''-^wt{sn) 

where  s„  is  the  n*''  highest  ranked  section  in 
the  document,  type(sn)  is  the  type  of  the  sec- 
tion, and  tw(ti)  is  the  weight  of  the  type  ti. 

The  weights  of  the  types  of  sections  were  ob- 
tained by  conducting  a  set  of  experiments 
where  all  types  but  one  were  given  a  weight 
of  1,  and  in  turn,  each  type  was  given  a 
weight  of  2.  Using  these  experiments  it  was 
determined  that  (purpose)  and  (summary) 
were  each  more  helpful,  and  that  (misc)  was 
less  helpful.  As  a  result,  in  this  experi- 
ment iiD((purpose))  =  (summary))  =  2, 
f«;((misc))  =  0.5,  and  other  weights  were  set 
to  1. 

Experiment  5:  Rank  the  documents,  and  rank 
the  sections.  Form  a  new  rank  based  on  the 
average  rank  of  these  two  ranks. 


Other  experiments  were  carried  out  using  best 
two  sections,  and  formulas  that  more  closely  ap- 
proximated the  cosine  measure.  None  of  these 
experiments  achieved  better  results  than  the  ones 
displayed  here.  The  obvious  conclusion  is  that  if 
documents  are  available  for  ranking  as  whole  doc- 
uments, then  for  this  collection  it  is  preferable  to 
do  so. 

3.3  Section  retrieval 

For  very  long  documents  it  may  be  desirable  to 
return  relevant  sections  rather  than  relevant  doc- 
uments. We  were  interested  to  see  whether  it 
might  be  useful  to  know  about  the  rank  of  the 
containing  document.  In  the  first  experiment 
documents  were  ranked,  and  sections  shown  in 
document  order.  This  produced  very  poor  re- 
sults. Next,  we  still  ranked  sections  in  higher 
ranked  documents  ahead  of  lower  ranked  docu- 
ments, but  used  section  ranking  for  sections  in 
the  same  document.  This  was  reasonable  but 
there  were  still  many  irrelevant  sections  being  ex- 
amined. Finally,  we  attempted  to  delete  these  ir- 
relevant sections  by  using  document  ranking,  and 
then  section  ranking,  but  this  time  discarding  sec- 
tions that  had  a  section  rank  of  greater  than  200. 

Experiment  6:  Rank  sections  against  the 
queries  using  standard  cosine  measure. 

Experiment  7:  Rank  full  documents  against 
the  queries  using  standard  cosine  measure. 
Order  the  sections  by  their  appearance 
within  documents. 

Experiment  8:  Rank  full  documents  against 
the  queries  using  standard  cosine  measure. 
Order  sections,  first  by  document,  then  by 
rank  within  documents. 

Experiment  9:  As  in  experiment  3,  but  then 
delete  all  but  the  200  highest  ranked  sec- 
tions. 

These  experiments  show  that  ranking  both 
documents  and  sections  does  help  to  find  more 
relevant  sections.  This  result  is  in  contrast  to 
the  earlier  investigation  of  finding  relevant  docu- 
ments. We  are  able  to  find  relevant  sections  much 
easier  if  the  rank  of  both  the  sections,  and  the 
containing  documents  are  taken  into  account. 

3.4  Paged  versus  section  retrieval 

Our  results  showing  that  retrieving  documents 
based  on  section  ranking  was  not  as  useful  as 
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Documents  / 
Experiment 

5 

10 

15 

20 

25 

30 

50 

200 

1 

0.286 

0.271 

0.248 

0.236 

0.234 

0.229 

0.206 

0.102 

2 

0.243 

0.221 

0.214 

0.204 

0.191 

0.178 

0.170 

0.092 

3 

0.271 

0.250 

0.229 

0.221 

0.206 

0.202 

0.184 

0.094 

4 

0.329 

0.257 

0.233 

0.229 

0.206 

0.202 

0.184 

0.085 

5 

0.343 

0.264 

0.238 

0.236 

0.234 

0.231 

0.209 

0.099 

Table  2:  Comparison  of  ranking  formula  for  fixed  number  of  documents  returned 


Documents  / 
Experiment 

5 

10 

15 

20 

25 

30 

50 

200 

6 

0.186 

0.164 

0.181 

0.161 

0.160 

0.152 

0.140 

0.120 

7 

0.100 

0.121 

0.105 

0.100 

0.094 

0.088 

0.090 

0.083 

8 

0.171 

0.121 

0.114 

0.100 

0.091 

0.083 

0.100 

0.082 

9 

0.214 

0.171 

0.186 

0.189 

0.189 

0.193 

0.179 

NA 

Table  3:  Comparison  of  ranking  formula  for  fixed  number  of  sections  returned 


document  ranking  was  perhaps  surprising — other 
studies  have  indicated  that  considering  smaller 
fragments  was  helpful,  although  in  combination 
with  larger  contexts.  We  wondered  whether  the 
section  boundaries  that  had  been  imposed  were 
inappropriate.  We  thus  applied  the  pagination 
techniques  described  in  Section  2.1  to  the  4,000 
document  database  used  here.  These  results  are 
shown  in  Table  4  as  Experiment  10. 

These  three  experiments  are  a  little  difficult 
to  interpret.  Dividing  documents  into  sections 
leads  to  poorer  retrieval  performance,  but  divid- 
ing further  into  pages  leads  to  comparable  re- 
trieval performance  to  ranking  whole  documents. 
It  may  be  that  the  manually  supplied  divisions 
are  poorer  than  the  divisions  generated  by  au- 
tomatic techniques  [5].  Experiments  2-4  show 
that  some  of  this  performance  degradation  can 
be  ameliorated  by  taking  documents'  structure 
into  account.  However,  these  experiments  indi- 
cate that  there  is  no  retrieval  advantage  in  break- 
ing the  document  up,  should  the  desired  unit  of 
retrieval  be  whole  documents. 

4  Conclusions 

The  combination  of  a  restricted-accumulators 
policy  and  the  introduction  of  skips  to  the  com- 
pressed inverted  file  entries  allows  fast  query 
evaluation  on  large  text  collections.  Moreover, 
the  ranking  can  be  carried  out  within  modest 
amounts  of  main  memory.    For  example,  the 


paged  TREC  collection  contains  1.7  million  pages, 
but  ranked  queries  of  50  or  more  terms  can  be  re- 
solved within  seconds  using  just  a  few  megabytes 
of  main  memory.  These  two  techniques  mean 
that  large  collections  can  be  searched  on  small 
machines  without  measurable  degradation  in  re- 
trieval effectiveness. 

In  the  second  part  of  the  experiment  we  have 
concentrated  on  large  documents,  breaking  them 
into  smaller  units  for  the  purposes  of  indexing. 
It  is  not  clear  that  users  are  interested  in  re- 
trieving 3  Mb  documents,  and  these  experiments 
were  designed  to  allow  users  the  option  of  retriev- 
ing smaller  parts  of  such  documents.  The  results 
were  mixed.  It  appears  that  indexing  both  sec- 
tions and  documents  is  helpful  in  ranking  sec- 
tions. However,  it  is  not  clear  what  an  appropri- 
ate indexing  strategy  is  if  only  full  documents  are 
to  be  returned.  We  are  continuing  our  investiga- 
tion of  partial  document  retrieval. 
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Abstract 

This  report  describes  a  few  experiments 
aimed  at  producing  high  accuracy  routing  and  re- 
trieval with  a  simple  Boolean  engine.  There  are 
several  motivations  for  this  work,  including:  (1) 
using  Boolean  term  combinations  as  a  filter  for 
advanced  data  extraction  systems,  (2)  improving 
"legacy"  Boolean  retrieval  systems  by  helping  to 
automate  the  generation  of  Boolean  queries,  and 
(3)  focusing  on  query  content,  rather  than  re- 
trieval or  ranking,  as  the  key  to  system  perfor- 
mance. The  results  show  very  high  accuracy, 
and  significant  progress,  using  a  Boolean  engine 
for  routing  based  on  queries  thai  are  manually 
generated  with  the  help  of  corpus  data.  In  ad- 
dition, the  results  of  a  straightforward  imple- 
mentation of  a  fully  automatic  ad  hoc  method 
show  some  promise  of  being  able  to  do  good  au- 
tomatic query  construction  within  the  context  of 
a  Boolean  system. 

1  Introduction 

Full-text  search  is  currently  the  simplest  and  most 
commonly-used  method  for  locating  information  in  large 
volumes  of  free  text.  Because  users  are  accustomed  to 
describing  what  they  are  looking  for  with  specific  words, 
and  those  words  are  often  found  in  the  texts,  searching  the 
text  for  selected  words  or  word  combinations  is  a  natural 
and  easy-to-implement  method  for  information  retrieval. 
However,  it  can  be  very  inaccurate.  It  can  be  especially 
difficult  for  searchers  to  compose  "queries"  that  combine 
the  words  that  are  effective  in  locating  relevant  material 
without  finding  large  quantities  of  irrelevant  information 
ais  well.  One  way  to  cope  with  this  diflRculty,  while  still 
preserving  the  advantages  of  the  full-text  search  engine, 

*Thjs  research  was  sponsored  in  part  by  the  Advanced  Reseaxch 
Project  Agency.  The  views  and  conclusions  contedned  in  this  doc- 
ument are  those  of  the  authors  and  should  not  be  interpreted  as 
representing  the  official  poUcies,  either  expressed  or  implied,  of  the 
Advanced  Research  Project  Agency  or  the  US  Government. 


is  to  help  to  automate  the  process  of  generating  Boolean 
queries.  This  was  the  focus  of  GE's  TREC-2  effort. 

GE's  involvement  in  TREC  represents  a  relatively  low 
level  of  effort  aimed  at  bringing  together  natural  language 
text  processing,  data  extraction,  and  statistical  corpus 
analysis  methods.  Our  project  uses  innovative  approaches 
for  extracting  information  from  text,  best  exemplified  in 
our  results  in  the  MUC  and  TIPSTER  extraction  evalua- 
tions [7,  3]  and  in  operational  text  management  systems 
in  GE.  In  TREC-1,  we  attempted  to  show  the  benefit 
of  natural  language  interpretation  by  using  Boolean  ap- 
proximation to  select  portions  of  text  that  could  be  fur- 
ther interpreted.  The  main  result  of  this  was  that  natural 
language  seems  to  have  very  little  to  offer  as  a  precision 
filtering  method,  because  routing  and  retrieval  problems 
stem  largely  from  having  the  wrong  terms  in  the  queries 
[6].  Thus,  in  TREC-2,  we  have  stuck  with  the  Boolean 
engine,  concentrating  on  the  use  of  corpus  analysis  to  im- 
prove the  queries. 

Figure  1  summarizes  our  TREC  results.  Our  results  in 
TREC-2,  as  in  TREC-1,  were  quite  good  relative  to  other 
systems.  The  manual  routing  system,  which  comprised 
over  99%  of  our  eff'ort,  produced  an  11-point  average  of 
.3308,  with  an  average  of  45  relevant  documents  in  the  top 
100.  This  put  GE's  system  at  the  very  top  of  the  man- 
ual routing  category  (the  system  with  the  best  11-point 
average  in  this  category  was  slightly  higher  on  the  11- 
point  average  and  had  slightly  fewer  relevant  documents, 
on  average,  in  the  top  100). 

The  residual  eff'ort  went  into  a  fully  automatic  ad  hoc 
method,  which  produced  an  11  point  average  of  .2183  and 
an  average  of  37  relevant  documents  in  the  top  100.  As  in 
TREC-1,  performance  varied  dramatically  by  topic.  The 
routing  system  showed  the  best  results  (in  terms  of  preci- 
sion at  100  documents)  on  8  of  50  topics.  Yet  it  was  below 
median  on  17  topics.  This  not  only  suggests  areas  for  fur- 
ther improvement,  but  also  shows  an  important  difference 
between  the  Boolean  approach  and  some  of  the  statistical 
retrieval  systems.  The  Boolean  approach  does  much  bet- 
ter on  certain  topics,  but  the  statistical  approaches  have 
more  consistent  performance. 
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AD  HOC  TEST 


ROUTING  TEST 


1 1  -pt. 
avg. 

Rel.@ 
100  docs. 

11 -pt. 
avg. 

Rel.@ 
100  docs. 

Boolean  '92 

.2029 

47.2 

.2078 

35.6 

Pattern  matcher  '92 

.1961 

46.2 

.1851 

34.6 

Avg.  median  run  '92 

.1585 

39.7 

.1246 

28.6 

Boolean  '93 

.2183* 

37* 

.3308 

45 

Avg.  median  run  '93 

.2620 

41 

.2910 

41 

fully  automatic 


Figure  1:  Summary  GE  Results  on  TREC-1  and  TREC-2 


While  it  is  very  hard  to  measure  progress  between 
TREC-1  and  TREC-2  because  none  of  the  numbers  are 
directly  comparable,  the  difference  between  our  .2078 
routing  average  in  TREC-1  and  .3308  in  TREC-2  is  large 
enough  that  we  are  quite  pleased  with  the  rate  of  progress 
and  confident  that  we  are  nowhere  near  the  peak  that  can 
be  achieved  with  manual,  Boolean  routing.  In  addition, 
the  automatic  ad  hoc  method  that  we  tried,  while  show- 
ing terrible  performance  on  some  of  the  more  convoluted 
topic  descriptions,  had  a  better  average  than  our  manual 
ad  hoc  system  last  year. 

Thus,  there  is  a  great  deal  to  be  gained  using  cor- 
pus analysis  to  automate  or  assist  in  query  generation, 
even  within  the  context  of  straightforward  retrieval  meth- 
ods. In  addition  to  having  promise  for  "legacy"  oper- 
ational systems,  these  results  suggest  that  natural  lan- 
guage methods,  focused  on  corpus  analysis  and  query  gen- 
eration, probably  can  help  in  improving  the  performance 
of  many  information  retrieval  systems. 


2    Boolean  Approximation 

The  basic  approach  in  our  system  is  to  compile  queries 
into  Boolean  tables  that  can  be  matched  at  high  speed 
against  a  stream  of  input  text.  This  approach  is  meant 
for  routing,  and  also  to  be  compatible  with  "downstream" 
analysis  such  as  what  we  do  in  TIPSTER  data  extraction. 
In  fact,  the  Boolean  compiler  we  use  is  designed  for  han- 
dling the  much  more  complex  expressions  that  our  system 
uses  in  data  extraction. 

Figure  2  illustrates  the  approach.  We  call  this  Boolean 
approximatton  because  the  Boolean  expressions  used  in 
the  basic  matching  engine  are  an  approximation  to  more 


detailed  processing  of  texts,  in  the  sense  that  they  are 
guaranteed  to  admit  all  text  that  would  be  admitted 
by  more  detailed  processing,  but  will  usually  also  admit 
many  texts  that  would  be  rejected  by  more  detailed  con- 
straints. This  is  a  very  general  method,  in  that  the  sys- 
tem can  be  configured  to  apply  many  different  stages  of 
analysis,  from  "shallower"  processing  to  "deeper"  inter- 
pretation, with  each  stage  applying  stricter  constraints — 
for  example,  word  order,  proximity,  semantic  constraints, 
and  so  forth.  Furthermore,  at  each  stage,  the  effects  of 
filtering  can  be  measured,  generally  showing  a  loss  of  re- 
call and  gain  in  precision.  In  TREC-1  [6],  we  measured 
this  tradeoff  and  found  that  the  highest  11-point  averages 
came  from  the  first  stage  of  filtering;  in  other  words,  the 
gains  in  precision  in  later  stages  were  not  enough  to  make 
up  for  the  loss  of  recall  on  these  measures. 

The  figure  also  illustrates  the  flow  of  information  at 
development  time  and  the  sort  of  knowledge  that  applies 
at  each  stage.  For  example,  at  development  time,  knowl- 
edge can  be  mapped  from  the  deeper  levels  into  shallower 
levels.  At  run  time,  the  subsequent  stages  of  language 
analysis  apply  this  knowledge  in  stages.'  For  example, 
our  system  can  analyze  joint  venture  texts  (in  English 
and  Japanese),  looking  for,  among  other  things,  infor- 
mation about  the  joint  venture  company  by  recognizing 
that  the  company  is  often  the  object  of  the  verb  esiab- 
hsh.  In  the  Boolean  stage,  this  can  be  approximated  by 
looking  for  the  combination  of  words  like  establish  with 
words  like  venture.  In  the  finite  state  pattern  matching 
stage,  the  system  might  look  for  any  word  with  the  root 
establish  followed  by  the  venture  term  (and  perhaps  the 
reverse  order  with  the  verb  in  a  passive  form).  In  deeper 
interpretation,  the  system  applies  syntactic  and  semantic 
constraints  to  recognize  the  different  ways  that  the  con- 
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Figure  2:  Approximation  and  Natural  Language  Processing 


cept  of  establishing  can  be  expressed,  and  insure  that  the 
words  appear  in  a  grammatically  acceptable  way  in  the 
input.  This  more  detailed  analysis  can  be  crucial  in  data 
extraction  tasks. 

A  critical  point  about  this  model  is  that  even  detailed 
knowledge  about  language  often  ends  up  contributing  to 
the  Boolean  tables,  so  the  Boolean  expressions  are  con- 
siderable more  complex  than  those  that  could  easily  be 
created  by  hand.  Furthermore,  the  Boolean  expressions 
are  a  relaxed  form  of  more  complex  knowledge-based  con- 
structs. It  is  very  difficult  to  predict  the  impact  of  sim- 
ply relaxing  constraints.  For  example,  TREC  topic  53 
includes  the  phrase  "leveraged  buy-out" .  In  a  strict  syn- 
tactic analysis,  this  phrase  would  have  to  appear  exactly 
in  that  form,  enforcing  both  proximity  ("leveraged"  ad- 
jacent to  "buy-out")  and  order  ("leveraged"  before  "buy- 
out"). But  relaxing  such  constraints  seldom  has  any  se- 
vere effect  on  results:  the  Boolean  expression  (leveraged 
AND  buy-out)  in  our  system  will  recognize  the  two  terms 
occurring  anywhere  in  the  same  paragraph,  which  is  actu- 
ally likely  to  admit  more  texts  about  leveraged  buy-outs 
than  admit  texts  in  which  the  words  coincidentally  ap- 
pear together. 

Even  the  phrase  "prime  rate" ,  which  matches  some  ir- 
relevant texts  in  its  Boolean  form  (including,  for  example, 
one  or  two  texts  about  Japan  that  mention  "prime  min- 
ister" in  the  context  of  "growth  rate"),  also  admits  some 
additional  relevant  texts  in  which  "prime"  appears  in  the 
same  paragraph  as  "rate" .  The  well-known  effects  of  word 
order  and  proximity  on  meaning,  exemplified  by  the  dis- 
tinction between  "blind  Venetian"  and  "Venetian  blind" 


do  not  seem  to  appear  very  frequently  in  real  examples. 
At  least,  these  effects  may  be  less  frequent 

In  TREC-2,  we  did  not  apply  any  constraints  stricter 
than  Booleans  at  run-time;  in  other  words,  we  used  only 
a  Boolean  retrieval  engine  (because  in  TREC-1  we  proved 
that  the  stricter  constraints  didn't  help).  We  also  had  to 
implement  a  module  to  relax  some  of  the  "hard"  Boolean 
constraints  for  topics  with  a  very  small  number  of  relevant 
documents,  because  of  the  nature  of  the  performance  met- 
rics. However,  we  still  used  the  knowledge  that  was  devel- 
oped for  more  detailed  processing,  including  the  phreises, 
semantic  groupings  and  so  forth.  In  addition,  we  added 
a  more  sophisticated  ranking  mechanism  than  we  used  in 
TREC-1,  because  ranking  is  very  important  in  the  evalua- 
tion. But  the  only  retrieval  engine  in  our  TREC-2  system 
is  a  Boolean  matcher. 

2.1    The  Boolean  Matcher 

The  Boolean  tables  are  efficiently  organized  so  that  a  C 
program  (which  we  now  know  as  NLgrep)  can  match 
them  against  incoming  texts  at  a  rate  of  about  1  million 
words  per  minute.  This  program  spends  little  time  on 
anything  other  than  marking  where  words  in  each  doc- 
ument match  terms  in  the  table.  In  both  routing  and 
retrieval,  the  total  number  of  terms  used  for  a  set  of  50 
topics  is  about  2000,  or  about  40  unique  terms  per  topic. 
All  other  words  are  ignored  entirely. 

For  example,  the  following  is  a  query  for  the  topic 
"South  African  sanctions"  (using  the  enhanced  regular 
expression  language  of  GE's  pattern  matcher  [5]): 


193 


Seuiction  ==  [(member  sanction  sanctions 
disinvestment) 
<Sullivaii  Principles> 
<pimitive  *2  measures>]  ; 

safrica  ==  [(member  Buthelezi  Pretoria 

anti-apcirtheid  apartheid) 
<De  Klerk> 

<South  (member  Africa  Africcin)>  ]  ; 

; ; ;  rule  1 

$sanction  *  $safrica  =>  (mark-topic  52)  ; 

This  description  says  that  any  matching  text  must  have 
both  an  indicator  of  South  Africa  ($saf  rica)  and  one  of 
sanctions  ($saiiction),  and  that  the  sanction  phrase  and 
South  Africa  phrase  must  appear  in  the  same  paragraph 
in  the  document. 

A  sanction  phrase  can  be  any  of  the  simple  words  sanc- 
iion,  sanctions,  or  disinvestment,  or  any  phrase  includ- 
ing punitive  measures  with  no  more  than  two  intervening 
words  (like  punitive  economic  measures).  A  South  Africa 
phrase  can  also  be  either  one  of  a  group  of  simple  words, 
or  a  phrase,  like  De  Klerk,  South  Africa,  or  South  African. 

These  queries  or  topic  descriptions  can  be  quite  com- 
plex, and  the  method  has  been  designed  to  handle  many 
queries  simultaneously,  so  the  rule  compiler  is  designed  to 
produce  expressions  that  can  be  efficiently  applied  within 
a  large  set  of  queries.  This  is  important  because  many 
queries  can  share  the  same  simple  terms  or  combinations 
of  terms,  and  because  the  Boolean  matcher  must  match 
the  simplest  expressions  first. 

For  the  topic  description  given  above,  the  output  of  the 
rule  compiler  will  include  the  following  tests: 


52 

TERM 

AFRICAN 

2029 

TERM 

MEASURES 

2134 

TERM 

SANCTION 

2135 

TERM 

SANCTIONS 

2136 

TERM 

DISINVESTMENT 

2138 

TERM 

SULLIVAN 

2139 

TERM 

PRINCIPLES 

2141 

TERM 

PUNITIVE 

2144 

TERM 

BUTHELEZI 

2145 

TERM 

PRETORIA 

2146 

TERM 

ANTI-APARTHEID 

2147 

TERM 

APARTHEID 

2149 

TERM 

DE 

2150 

TERM 

KLERK 

2152 

TERM 

SOUTH 

2153 

TERM 

AFRICA 

2137 

OR 

2134  2135  2136 

2140 

AND 

2138 

2139 

2142 

AND 

2141 

2029 

2143 

OR 

2137 

2140 

2142 

2148 

OR 

2144 

2145 

2146 

2151 

AND 

2149 

2150 

2154 

OR 

2153 

52 

2155 

AND 

2152 

2154 

2156 

OR 

2148 

2151 

2155 

2157 

AND 

2143 

2156 

2158 

AND 

2156 

2143 

T0PIC052 

OR 

2157 

2158 

Each  line  in  the  above  data  gives  a  unique  number 
(or  topic  designator)  to  the  test,  a  test  identifier  (either 
TERM  for  a  simple  word  test,  OR,  or  AND),  and  a  list 
of  simple  terms  or  previous  tests.  For  example,  test  2137 
depends  on  tests  2134,  2135,  and  2136,  and  is  true  if  any 
of  those  tests  is  true,  namely,  if  the  text  includes  any 
of  the  words  sanction,  sanctions,  or  disinvestment.  The 
tests  are  automatically  ordered  so  that  all  tests  that  are 
dependent  on  other  tests  will  have  higher  numbers  than 
the  tests  they  depend  on;  thus  all  TERM  tests  appear 
first.  In  this  case,  the  TERM  test  AFRICAN  appears 
with  a  much  lower  number  simply  because  it  is  used  in 
many  diff'erent  queries. 

The  matcher,  which  can  work  either  on  complete  doc- 
uments or  paragraphs  (but  we  used  paragraph  matching 
only  in  TREC-2)  goes  through  every  word  in  its  input 
and,  using  a  fast  table  look-up,  sets  the  TERM  tests  to 
true  for  every  word  it  encounters.  At  the  end  of  input,  ei- 
ther the  end  of  the  paragraph  or  end  of  each  document,  it 
runs  through  the  table  of  possible  tests  from  low  numbers 
to  high  numbers  and  sets  tests  to  true  if  their  conditions 
are  satisfied.  A  topic  test  produces  a  match  if  it  has  be- 
come true  at  the  end  of  this  process,  meaning  that  the 
paragraph  or  document  has  passed  the  pre-filter  for  that 
query.  A  single  paragraph,  of  course,  can  satisfy  multiple 
queries. 

This  portion  of  the  system  was  implemented  in  the 
space  of  a  few  days,  and  is  almost  entirely  the  same  as  in 
TREC-1.  Our  focus  since  last  year  has  been  on  query  con- 
struction and  ranking  rather  than  matching  or  retrieval. 

2.2    Query  construction 

Our  approach  assumes,  in  general,  that  manual  query 
construction  is  acceptable  for  routing.  In  ad  hoc  retrieval, 
query  time  can  be  of  the  essence,  but  in  many  routing 
applications,  queries  are  developed  and  refined  over  time. 
The  amount  of  time  spent  on  query  construction  using 
a  manual  method  in  our  system  is  comparable  to  the 
amount  of  time  spent  on  the  topic  descriptions  used  for 
automatic  query  generation. 
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2.2.1     "Manual"  queries  for  routing 

In  manual  routing,  our  approach  uses  a  statistical  cor- 
pus analysis,  developed  originally  for  text  categorization 
[4],  to  pull  out  terms  based  on  their  relative  frequency 
in  relevant  documents  for  each  topic.  The  statistic  used 
combines  the  entropy-based  mutual  information  statistic 
(testing  the  independence  of  each  term  with  each  topic) 
with  a  correction  for  low-frequency  terms  and  for  ambigu- 
ous words.  Words  with  high  weights  have  a  high  degree 
of  association  with  a  topic.  This  statistical  analysis  is 
also  used  in  ranking.  The  base  weighting  formula  is  the 
following: 

C(log2  6)(log2r) 

where  C  is  a  constant,  b  is  number  of  times  a  term 
appears  in  a  story  assigned  to  a  particular  category,  for 
example,  and  log2  r  is  the  log  of  the  ratio  of  combined 
probabilities  (i.e.,  of  a  particular  word  or  phrase  occur- 
ring in  a  text  about  a  particular  category)  to  the  prod- 
uct of  independent  probabilities — the  mutual  information 
statistic.  This  tests  the  assumption  that  the  use  of  the 
word  and  the  category  of  the  text  are  independent.  When 
this  assumption  is  false,  the  word  gets  a  high  positive  or 
negative  weight. 

For  example,  the  following  are  the  top  words  for  Topic 
51,  "Airbus  subsidies": 


A-330 

T0PIC51 

1263.3 

AIRBUS 

T0PIC51 

1183.1 

A-340 

T0PIC51 

1178.2 

INDUSTRIE 

T0PIC51 

1071.9 

A-320 

T0PIC51 

1067.5 

MESSERSCHMITT- 

BOELKOW-BLOHM 

T0PIC51  i 

AERONAUTICAS 

T0PIC51 

843.3 

CONSTRUCCIONES 

T0PIC51 

807.9 

AEROSPATIALE 

T0PIC51 

762.5 

MBB 

T0PIC51 

722.6 

WIDE-BODY 

T0PIC51 

617.7 

MD-11 

T0PIC51 

613.8 

TOULOUSE 

T0PIC51 

228.5 

JETLINERS 

T0PIC51 

217.8 

LUFTHANSA 

T0PIC51 

196.6 

MD-80 

T0PIC51 

196.6 

Clearly,  these  words  all  have  some  reason  to  be  associ- 
ated with  this  topic,  but  adding  them  to  the  appropriate 
group  in  each  query  (or  ignoring  them  entirely)  is  a  "man- 
ual" process.  Our  manual  routing  queries,  therefore,  are 
a  combination  of  the  regular  expressions  that  were  devel- 
oped from  the  topic  descriptions  with  terms  added  that 
were  selected  from  the  automatic  training.  This  is,  we 
believe,  a  very  practical  manual  approach  that  has  very 
good  performance. 


2.2.2  "Hard"  vs.  "Soft"  Booleans 

The  Boolean  matcher  uses  a  "hard"  Boolean  approach,  in 
that  it  will  admit  only  texts,  for  each  query,  that  satisfy 
the  conditions  of  that  query.  For  example,  in  Topic  51 
above,  "Airbus  subsidies",  the  matcher  will  allow  texts 
only  that  have  both  and  Airbus  term  and  a  subsidy  term 
in  the  same  paragraph.  However,  this  is  a  narrow  topic, 
and  TREC-2  allows  each  system  to  produce  1000  texts 
for  each  topic.  The  evaluation  metrics  offer  no  penalty 
for  filling  up  the  list  of  1000  with  texts  that  are  likely  to 
be  irrelevant.  So,  in  order  to  provide  increeised  flexibility 
and  consider  larger  numbers  of  texts  for  each  topic,  we 
used  an  additional  engine  only  for  the  purpose  of  pulling 
in  texts  for  very  specific  queries  like  this  one. 

The  system  is  still  a  hard  Boolean  system  in  that  texts 
that  satisfy  the  Boolean  conditions  will  always  be  ranked 
higher  than  texts  that  do  not  satisfy  the  conditions;  how- 
ever, texts  that  do  not  satisfy  the  conditions  can  appear 
on  the  final  ranked  lists.  The  "soft"  engine  considers  such 
texts  by  relaxing  some  of  the  Boolean  conditions,  effec- 
tively pulling  in  texts  that  have  a  large  number  of  terms 
that  match  the  query,  but  do  not  necessarily  meet  all  the 
conditions.  This  component  of  the  system  is  more  like 
statistical  retrieval  engines;  however,  it  does  not  have  a 
large  impact  on  the  overall  scores,  because  it  only  af- 
fects the  results  at  the  low-precision  extreme  (the  low- 
est rankings)  for  queries  that  match  very  few  documents. 
In  fact,  for  Topic  51,  the  hard  Boolean  query  matches 
11  texts,  and  the  soft  method  pulls  in  an  additional  989 
texts.  But  there  are  only  11  texts  that  are  judged  rele- 
vant, of  which  10  satisfy  the  hard  Boolean.  So  enforcing 
the  hard  Boolean  condition  seems  to  work  well  for  this 
topic,  and  the  soft  Boolean  doesn't  contribute  much. 

For  some  topics,  the  balance  between  the  hard  Boolean 
condition  and  the  soft  Boolean  isn't  so  clear;  i.e.  not  en- 
forcing the  hard  condition  would  lead  to  better  ranking. 
This  seems  be  a  function  of  how  well  the  topics  fit  with 
Boolean  expressions  in  general.  "Airbus  subsidies"  is  re- 
ally a  Boolean  topic,  in  that  a  relevant  text  must  say 
something  about  Airbus  and  something  about  subsidies. 
Other  topics  like  "automation"  are  much  harder  to  ex- 
press in  a  Boolean  form.  We  will  cover  these  issues  in 
topic-by-topic  performance  later  in  this  paper. 

2.2.3  "Automatic"  queries  for  ad  hoc  retrieval 

The  "manual"  query  method  is  partly  automated  in  the 
sense  that  the  corpus-based  statistical  training  suggests 
many  of  the  terms  that  are  used  in  the  queries.  But  it  is 
"manual"  in  that  the  initial  formulation  of  the  queries 
is  done  manually  from  the  TREC  topic  descriptions. 
We  have  tried,  as  a  simple  experiment,  to  generate  the 
Boolean  term  groupings  and  expand  each  term  automat- 
ically from  the  topic  descriptions.  In  the  days  before  the 
TREC-2  ad  hoc  test,  we  tried  several  different  ways  of  do- 
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ing  this  automatic  Boolean  query  generation,  and  chose 
the  one  that  worked  best  on  the  sample  data. 

Our  first  attempt  was  to  use  the  common  methods  for 
finding  collocations  and  word  associations  in  sentences, 
and  these  worked  horribly  for  term  expansion.  The  prob- 
lem is  that  this  approach  finds  more  associations  like  "fu- 
neral" and  "home"  than  it  does  "hostage"  and  "captive" , 
and  the  latter,  text-level  associations  are  what's  required 
to  generate  good  queries. 

The  "solution"  we  tried  was,  for  a  sample  of  about  10 
million  words  in  the  corpus,  to  choose  the  top  20  words 
based  on  TF.IDF  weights  for  each  document,  store  the 
frequency  of  association  among  these  terms,  and  then 
weight  each  pair  using  the  weighted  mutual  information 
statistic  of  the  previous  section.  This  was  much  better 
than  using  sentence-level  information,  although  it  is  still 
a  very  straightforward  approach.  For  example,  the  fol- 
lowing are  the  top  10  terms  associated  with  the  word 
"hostage"  (in  order): 

hostages 

Lebanon 

Beirut 

Ircin 

release 

Terry 

kidnappers 

kidnapped 

Jihad 

Anderson 

While  this  is  certainly  not  the  optimal  set  of  terms  to 
use  in  place  of  "hostage",  it  is  a  good  start. 

The  next  problem  in  automatic  query  construction  is 
when  to  use  a  combination  of  terms  and  when  to  use 
a  single  term.  For  example,  the  term  "weather-related 
fatalities"  is  a  combination  of  two  word  groups  (weather 
and  fatalities)  while  "Iran-contra  affair"  is  really  only  one 
group  (Iran-contra),  even  though  it  might  appear  that 
"aflfair"  is  a  significant  term. 

Again  we  took  the  direct  approach,  choosing  to  com- 
bine terms  whenever  there  was  a  reasonable  percentage  of 
overlap  between  their  associated  terms.  This  worked  sur- 
prisingly well  in  cases  where  the  topic  title  was  a  good  de- 
scription (e.g.  "welfare  reform")  and  very  badly  for  those 
with  vague  titles  (e.g.  "find  innovative  companies").  We 
tried  to  recover  from  these  by  including  more  words  from 
the  description  and  narrative,  but  then  we  had  to  start 
recognizing  the  language  of  these  descriptions,  filtering 
out  words  like  "relevant",  "mention"  and  so  forth.  At  this 
crude  stage,  the  main  problem  with  the  query  generation 
method  is  using  the  structure  of  the  topic  descriptions. 

The  second  major  issue  with  automatic  query  genera- 
tion is  that  it  isn't  nearly  as  good  at  finding  good  terms 
as  the  process  of  training  from  data  and  relevance  judge- 
ments, as  used  in  the  routing  experiments.  The  relevance 


judgements  used  for  routing  contain  large  volumes  of  rela- 
tively high-accuracy  data,  while  the  training  used  for  term 
expansion  in  query  generation  relied  on  relatively  small 
volumes  of  relatively  noisy  data.  For  example,  the  word 
"welfare"  used  in  one  of  the  ad  hoc  topics  occurred  with  a 
high  enough  TF.IDF  weight  only  29  times  in  the  training 
sample,  and  the  most  frequently  associated  term,  "chil- 
dren" ,  occurred  only  6  times.  In  order  to  establish  good 
associations  between  "welfare"  and  less  frequent  terms, 
we  would  need  much  more  data.  The  data  from  TREC-2 
seem  to  suggest  that  low-frequency  terms  contribute  more 
in  term  expansion  than  high-frequency  terms,  so  using  a 
"small"  training  sample  (10  million  words  is  only  about 
3%  of  the  corpus)  was  a  major  error.  We  made  many 
other  mistakes  in  the  training  method,  including  mixing 
samples  from  the  Federal  Register  and  DOE  sources  with 
other  texts  that  are  much  more  likely  to  be  relevant.  This 
leaves  a  lot  of  room  for  future  experiments  and  improve- 
ment. 

The  fully  automatic  ad  hoc  system  certainly  didn't  do 
as  well  as  the  manual  routing  system,  but  it  was  still  at 
or  above  median  for  more  than  half  of  the  ad  hoc  topics. 
Considering  that  this  method  could  be  used  within  the 
context  of  most  any  legacy  retrieval  system,  the  result 
is  worth  noting.  Furthermore,  the  generation  of  Boolean 
queries  from  natural  language  descriptions  is  an  interest- 
ing, as  well  as  practical,  research  problem,  because  many 
diff"erent  retrieval  systems  can  make  some  use  of  Boolean 
queries. 

3  Ranking 

In  both  routing  and  ad  hoc,  we  used  a  set  of  word  weights 
for  ranking,  acquired  using  the  relevance  judgements  in 
the  routing  case  and  from  the  corpus  data  in  the  ad  hoc 
case.  In  routing,  the  weights  reflect  the  statistical  mea- 
sure of  association  between  the  term  and  each  topic  (using 
the  weighted  mutual  information  score  given  earlier).  In 
the  ad  hoc  case,  the  weight  is  a  function  of  the  frequency 
of  the  term  in  the  topic  description,  the  inverse  collec- 
tion frequency,  and  an  additional  factor  to  weight  certain 
components  of  the  topic  descriptions  (such  as  the  title  and 
description)  more  heavily  than  others.  We  combined  the 
weighted  frequency  of  these  terms  with  an  overall  count 
of  the  number  of  topic  hits  per  document,  normalizing  for 
document  length,  to  produce  a  score  for  each  document. 
This  was  the  result  of  trying  many  diff"erent  approaches 
on  the  test  data,  so  it  was  definitely  a  good  method  for 
our  system. 

However,  in  comparing  our  results  with  those  of  other 
systems,  our  precision  curve  across  various  recall  points 
is  not  nearly  as  good  as  a  system  that  does  really  good 
ranking.  In  routing,  we  are  not  sure  that  ranking  is  im- 
portant, but  it  is  certainly  important  in  getting  good  re- 
sults in  TREC.  So,  we  are  inclined  to  try  to  combine  our 
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retrieval  method  with  alternative  ranking  methods  to  see, 
for  example,  whether  more  terms  are  really  necessary  in 
order  to  get  better  ranking  results. 

The  separation  of  retrieval  and  ranking  seems  to  be  a 
valuable  tool  both  for  experimental  research  and  for  iden- 
tifying different  techniques  for  applications.  It  is  clearly  a 
problem  with  both  TREC-1  and  TREC-2  that  the  routing 
task  requires  a  comparison  of  documents  across  a  large 
collection,  when  most  routing  applications  deal  with  a 
stream  of  documents  individually  or  in  small  groups. 

4    Analysis  of  Results 

The  results  raise  a  number  of  important  issues,  espe- 
cially: why  Boolean  approximation  works  as  well  as  it 
does,  particularly  why  it  works  for  routing;  where  statis- 
tical weighting  could  help  more;  what  sort  of  topics  this 
approach  does  well  on  (and  which  topics  it  does  badly 
on);  and  other  obvious  areas  for  improvement. 

One  of  the  most  important  sources  of  information 
about  the  advantages  and  disadvantages  of  each  approach 
comes  from  comparing  the  performance  of  different  sys- 
tems on  different  topics.  Unfortunately,  this  is  also  a 
very  difficult  task,  because,  while  it  is  easy  to  tell  which 
systems  did  well  on  which  topic,  it  is  often  hard  to  gener- 
alize from  that  evidence  why  the  approach  worked  or  why 
it  didn't. 

As  we  have  mentioned,  the  Boolean  approach  is  very 
erratic  with  respect  to  performance  by  topic,  as  compared 
with  other  systems,  particularly  the  statistical  methods 
that  emphasize  weighting.  For  example,  our  manual  rout- 
ing system,  which  was  clearly  one  of  the  best  systems,  had 
the  top  performance  (in  precision  at  100  documents)  on  8 
topics,  but  was  below  median  on  17  topics  (out  of  50).  In 
the  11-point  averages,  that  system  was  below  median  on 
22  topics — more  than  40%  of  the  time — although  it  out- 
performed most  of  the  systems  on  average.  By  contrast, 
one  of  the  Cornell  systems  [1]  was  above  median  on  ev- 
ery topic!  This  suggests  that  our  approach  degrades  less 
gracefully  than  other  approaches,  and  that  it  is  important 
to  explore  Boolean  methods  as  an  adjunct  to  other  meth- 
ods that  work  in  the  cases  where  the  Boolean  approach 
seems  to  fail.  Conversely,  our  system  had  top  or  near-top 
scores  on  a  significant  number  of  topics;  it  is  important 
to  know  how  to  take  advantage  of  this  within  the  context 
of  weighting  systems. 

There  seem  to  be  several  different  explanations  for  vari- 
ation on  the  topics.  First,  there  are  topics,  as  we  have  dis- 
cussed, that  are  particularly  well  suited  to  Boolean  meth- 
ods (and  others  that  are  not  well  suited  at  all).  Second, 
there  are  cases  where  the  training  method  seems  to  work 
particularly  well.  Third,  there  are  cases  where  the  man- 
ual approach  might  work  well  because  there  are  terms  in 
the  topic  description  that  are  particularly  misleading.  Fi- 
nally, there  are  many  reasons  why  our  approach  can  fail. 


particularly  on  topics  with  very  small  numbers  of  relevant 
documents  and  in  cases  where  the  topics  are  very  vaguely 
specified. 

One  of  the  topics  where  GE  had  the  best  results  was 
Topic  53,  "leveraged  buy-outs".  The  topic  description 
specified  that  relevant  documents  had  to  describe  an  LBO 
above  $100  million  in  value,  and  give  the  terms  of  the 
buy-out.  Apparently,  the  $100  million  figure  is  not  im- 
portant, because  most  of  the  LBO's  that  are  reported  are 
major  buy-outs.  However,  the  terms  (the  specification 
of  the  dollar  amount)  are  required.  Many  articles  about 
LBO's  do  not  report  dollar  amounts.  This  is  similar  to 
the  "Airbus  subsidies"  topic,  where  many  articles  that 
talk  about  Airbus  do  not  mention  subsidies,  and  they  are 
not  relevant.  The  advantage  here  seems  to  be  that  the 
hard  Boolean  outperforms  the  weighting  approaches  be- 
cause weighting,  without  Booleans,  is  likely  to  give  an 
article  with  many  LBO  words,  but  no  dollar  figures,  a 
high  weight,  just  as  it  could  a  high  weight  to  an  article 
about  Airbus  that  doesn't  mention  subsidies. 

The  effect  of  training  seems  to  help  in  the  "leveraged 
buy-out"  case  as  well.  The  training  picked  up  many 
names  of  companies  involved  in  buy-outs,  like  "Safeway" 
and  "Dart" ,  and  these  were  included  in  the  queries.  This 
perhaps  helped  to  separate  articles  about  specific  buy- 
outs from  buy-outs  in  general.  A  similar  effect  came 
about  on  Topic  92,  "international  military  equipment 
sales",  where  the  training  pulled  in  names  of  many  of 
the  weapons  typically  sold  on  the  international  market. 

Topic  86,  "bank  failures"  was  another  topic  where  the 
GE  system  outperformed  all  others  on  both  the  11-point 
average  and  precision  at  100  documents.  This  result  is 
hard  to  explain,  but  the  one  conspicuous  fact  about  the 
topic  is  that  our  query  does  not  include  the  word  "bank" . 
It  does  include  the  names  of  many  prominent  banks,  so  it 
may  be,  like  the  LBO  case,  that  good  performance  on  this 
topic  depends  mainly  on  distinguishing  specific  references 
to  failures  from  general  discussions  about  bank  failures, 
for  example,  the  S&L  crisis. 

On  the  topics  that  have  very  few  relevant  documents, 
our  approach  often  failed  because,  in  the  absence  of  train- 
ing data,  it  tended  to  undergenerate;  thus  very  few  texts 
(sometimes  none)  would  match  the  Boolean  query  and 
other  systems  with  good  weighting  would  pull  in  more 
relevant  documents.  In  these  cases,  a  system  that  finds 
one  relevant  document  scores  much  better  than  a  system 
that  finds  zero,  so  the  penalty  for  undergeneration  is  very 
high. 

The  second  class  of  topics  where  we  seem  to  go  wrong 
is  in  those  that  are  vaguely  specified.  For  example.  Topic 
74,  "policy  conflict"  is  a  very  hard  topic,  where  the  de- 
scription does  not  include  very  much  information.  Texts 
rarely  mention  policy  conflict,  and,  when  they  do,  they 
are  rarely  relevant.  On  the  other  hand,  texts  about  to- 
bacco policies  and  health  are  likely  to  be  relevant.  This 
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is  probably  a  case  where  there  is  no  point  in  having  a 
Boolean  query,  and  probably  systems  with  the  best  train- 
ing and  weighting  methods  do  best. 

Another  important  point  about  the  TREC-2  results  is 
that  the  best  automatic  systems  did  significantly  better 
than  the  best  manual  systems  in  the  routing  task,  prob- 
ably because  of  the  success  of  the  training  methods  used. 
This  raises  an  important  question  with  respect  to  the  rel- 
ative contributions  of  time  spent  on  query  construction 
versus  time  spent  assembling  training  data.  The  volume 
of  training  data  used  in  TREC-2,  with  hundreds  of  thou- 
sands of  relevance  judgements,  is  not  realistic  for  most 
routing  scenarios.  Thus,  while  we  should  continue  to  rely 
on  good  training  methods,  we  should  be  careful  to  sep- 
arate out  the  effects  of  training  and  to  develop  manual 
routing  methods  that  work  with  smaller  amounts  of  train- 
ing data. 

Given  that  Boolean  and  manual  methods  seem  to  do 
best  on  certain  topics,  and  other  approaches  that  em- 
phasize weighting  do  better  on  others,  it  makes  sense  to 
combine  the  best  from  different  approaches.  However, 
this  raises  the  issue  of  how  different  results  can  be  incor- 
porated into  our  model  without  losing  the  advantages  of 
the  Boolean  method,  particularly  the  compatibility  with 
so  many  existing  systems. 

5    Future  Goals 

Many  of  our  results  from  TREC-2  suggest  areas  where  the 
method  of  generating  Boolean  queries,  especially  using 
corpus  data,  can  be  substantially  improved.  There  is  a 
great  deal  of  room  for  future  progress,  so  we  believe  that 
this  approach  will  continue  to  be  viable  with  respect  to 
other  routing  and  retrieval  methods. 

The  main  area  for  research  is  in  continuing  to  ex- 
plore new  corpus  analysis  methods.  Our  corpus  analyzer 
weights  single  terms.  But  the  Boolean  queries  depend  on 
combinations  of  terms,  not  only  in  the  case  of  phrases  but 
also  to  control  the  effect  of  ambiguous  words.  In  the  con- 
text of  a  given  query,  a  single  term  can  often  be  roughly 
comparable  to  the  Boolean  AND  of  two  or  more  other 
terms.  Up  to  this  point,  we  have  not  quite  been  able  to 
automate  the  process  of  discovering  these  relationships  in 
the  corpus.  This  is  important  for  both  routing  and  ad 
hoc  retrieval,  but  especially  for  routing. 

Both  the  routing  and  ad  hoc  systems  can  benefit  from 
the  use  of  new  ranking  methods,  and  possibly  from  ex- 
ploring hybrid  approaches  that  take  advantage  of  the 
Boolean  method  on  topics  that  are  well  suited  to  Boolean 
expression  and  degrade  more  gracefully  to  traditional 
weighting  methods  on  other  topics.  In  general,  the  com- 
bination of  methods  is  something  that  merits  new  exper- 
iments. 

The  routing  system  could  benefit  from  new  training 
methods.  Because  the  Cornell  system  did  especially  well 


using  a  training  method  that  produced  large  numbers  of 
terms  and  used  both  positive  and  negative  information, 
it  is  possible  that  this  general  approach  could  help  our 
system  as  well. 

The  ad  hoc  system  suffers  mostly  from  difficulties  in 
handling  the  topic  descriptions;  our  method  of  deriving 
Boolean  expressions  from  the  topic  descriptions  is  still  ex- 
tremely crude,  and  there  are  many  topics  for  which  the 
approach  produced  almost  useless  queries.  The  perfor- 
mance of  the  automatic  ad  hoc  system  was  even  more 
erratic  than  that  of  the  manual  routing  system,  but  it  is 
very  likely  that  many  of  the  problems  can  be  solved  with 
a  lot  more  work  on  processing  the  topic  descriptions.  Al- 
though we  are  loath  to  direct  research  at  issues  that  are 
particular  to  the  formulation  of  the  TREC  topics,  this 
work  may  be  necessary  to  determine  the  real  power  of 
automatically  generating  Boolean  queries. 

Finally,  we  are  interested  in  exploring  many  ways  that 
our  corpus  analysis  and  query  generation  component  can 
be  combined  with  other  systems.  Because  we  have  fo- 
cused our  attention  on  query  content  rather  than  ranking 
or  retrieval  models,  we  believe  that  our  results  could  quite 
likely  be  used  within  many  other  retrieval  systems.  It  is 
natural  to  look  for  such  synergy.  We  have  had  some  pre- 
liminary collaboration  with  the  UMass  team  to  try  to  use 
our  queries  within  the  INQUERY  system  [2],  but  we  still 
have  a  long  way  to  go.  We  will  continue  to  explore  such 
collaborative  efforts  and  to  concentrate  our  own  efforts 
on  corpus  analysis  and  building  queries. 

6  Summary 

GE's  participation  in  TREC  involved  the  implementa- 
tion of  a  number  of  strategies  for  creating  Boolean  queries 
from  the  topic  descriptions.  A  statistical  corpus  analyzer 
helped  to  refine  queries  for  both  the  routing  task,  and 
to  generate  them  automatically  for  the  ad  hoc  task.  The 
simple  Boolean  retrieval  engine  performed  well,  especially 
in  routing.  As  before,  there  is  tremendous  variation  in 
the  topic-by-topic  results,  suggesting  that  a  great  deal 
more  research  is  needed  to  find  how  to  get  the  best  re- 
sults in  different  routing  and  retrieval  scenarios.  We  are 
encouraged  by  the  progress  of  our  system,  as  well  as  of 
the  overall  field,  in  these  experiments,  and  are  hopeful 
that  in  the  coming  years  we  will  learn  how  to  combine 
our  promising  results  in  Boolean  approximation  and  cor- 
pus analysis  with  the  more  mature  ranking  and  retrieval 
models  of  some  of  the  other  systems. 
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1.0  Introduction 

For  TREC-II,  we  were  interested  in  experimenting  with  improved  methods  of  constructing 
queries  for  the  Fast  Data  Finder  (FDF)  text  search  coprocessor.  We  learned  from  TREC-I 
that  while  the  pattern  matching  ability  of  the  FDF  can  sometimes  be  put  to  significant 
advantage  (we  had  the  high  score  on  8  of  the  50  routing  topics  in  TREC-I),  this  wasn't 
sufficient  overall  to  overcome  the  weaknesses  traditionally  associated  with  the  boolean 
approach  to  text  retrieval.  Many  of  the  TREC  topics  are  too  abstract  and  ambiguous  to 
respond  well  to  a  boolean  query  formulation. 

Our  goal  for  this  year  therefore,  was  to  apply  the  FDF  hardware  to  a  more  statistical  or  soft 
boolean  retrieval  approach  while  not  giving  up  on  our  ability  to  make  use  of  specific 
features  or  patterns  in  the  text  when  they  are  obviously  important. 

We  experimented  with  two  different  schemes.  In  the  first  scheme,  we  utilized  subquery 
proximity  to  rank  hit  documents.  We  developed  the  subqueries  manually,  then  determined 
the  optimum  proximity  values  by  test  runs  on  the  training  data.  The  most  effective  values 
were  then  used  in  the  official  routing  queries.  The  second  scheme  was  an  FDF  adaptation 
of  the  traditional  Information  Retrieval  (IR)  term  weighting  approach.  In  addition  to  single 
word  terms,  we  also  included  two  and  three  word  phrases,  and  FDF  subqueries  designed  to 
detect  special  features  in  the  text. 

While  in  the  terminology  of  TREC  both  are  examples  of  manual  query  formulation  with 
feedback,  we  believe  these  techniques  can  be  evolved  to  create  queries  automatically  from 
samples  of  relevant  text  and  to  also  incorporate  user  knowledge  of  specific  text  features  of 
interest  when  it  exists.  We  also  continue  to  believe  that  the  utilization  of  a  hardware 
accelerator  such  as  the  Fast  Data  Finder,  enables  the  implementation  of  high  performance 
routing  or  dissemination  applications  at  a  far  lower  cost  than  can  be  achieved  with 
conventional  general  purpose  processors. 
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2.0     The  FDF  Text  Retrieval  Approach 


The  Fast  Data  Finder  is  a  hardware  device  that  performs  high-speed  pattern  matching  on  a 
stream  of  8-bit  data.  It  consists  of  an  array  of  identical  programmable  text  processing  cells 
connected  in  series  to  form  a  pipeline  processor.  The  cells  are  implemented  using  a  custom 
VLSI  chip  designed  and  patented  by  TRW.  In  the  latest  implementation,  each  chip  contains 
24  processor  cells  and  a  typical  system  will  have  3,600  cells.  Each  cell  can  match  a  single 
character  of  query  or  perform  all  or  part  of  a  logical  operation.  The  processors  are 
interconnected  with  an  8-bit  data  path  and  approximately  20-bit  control  path.  To  perform 
a  search,  a  microcode  program  is  first  downloaded  into  the  pipeline  to  direct  each 
processor.  The  database  is  then  streamed  through  the  pipeline.  The  data  bytes  clock 
through  each  processor  in  turn  until  the  whole  database  has  passed  through  all  processors. 
As  the  data  is  clocking  through,  the  processors  alter  the  state  of  the  control  lines  depending 
on  their  program  and  the  data  stream  values. 

When  the  pipeline's  processor  cells  detect  that  a  series  of  database  characters  match  the 
desired  pattern,  a  hit  is  indicated  and  passed  by  external  circuitry  back  to  the  memory  of  the 
host  processor  and  to  the  user.  The  FDF  pipeline  runs  at  a  constant  speed  as  it  performs 
character  comparisons  and  logical  operations,  regardless  of  query  complexity. 

The  queries  or  patterns  are  specified  in  the  FDF's  Pattem  Specification  Language  (PSL). 
The  hardware  directly  supports  all  the  features  in  the  PSL  query  language  without  the  need 
for  software  post-processing.  The  processors  in  the  pipeline  may  all  be  used  to  evaluate  a 
single  large  query  or  may  be  assigned  to  evaluate  numerous  smaller  queries.  The  number 
of  pipeline  cells  a  query  needs  is  proportional  to  the  size  of  the  query,  PSL  provides 
numerous  search  functions,  which  may  be  nested  in  any  combination,  including: 

•  Boolean  logic  including  negative  conditions 

•  Proximity  on  any  arbitrary  pattem 

•  Wildcards  and  "don't  cares"  anywhere  in  the  word 

•  Character  alternation 

•  Term  counting,  thresholds,  and  sets 

•  Error  tolerance  (fuzzy  matching) 

•  Term  weighting 

•  Numeric  ranges 

The  Fast  Data  Finder  was  originally  designed  and  developed  at  TRW.  In  1992,  TRW 
licensed  the  FDF  technology  to  Paracel  Inc.,  which  now  sells  a  commercial  product  called 
the  FDF-3. 

3.0      Proximity  Query  Generation 

Our  first  set  of  experiments  revolved  around  the  use  of  subquery  proximity  to  rank  hit 
documents.  We  began  with  a  simple  observation:  topics  are  often  a  conjunction  of  ideas  or 
concepts.  For  example.  Topic  51,  Airbus  Subsidies,  is  a  conjunction  of  the  idea  "Airbus" 
(a  particular  aircraft  manufacturer  and  European  consortium)  and  the  idea  "subsidy"  (in 
particular,  subsidies  from  the  nations  belonging  to  that  consortium).  Other  articles  about 
Airbus  Industrie  (new  planes,  fly-by-wire  in  the  A320,  accident  reports,  financial  health 


202 


reports,  etc.)  are  not  relevant  to  this  topic;  nor  are  articles  which  describe  subsidies  not 
directed  toward  Airbus. 

Traditional  IR  term  weighting  techniques  do  not  give  any  explicit  benefit  to  articles  which 
conjoin  ideas.  Articles  which  include  terms  relevant  to  each  of  the  component  sub-topics 
will  receive  high  scores;  but  so  will  articles  which  include  many  terms  relevant  to  only  one 
sub-topic.  Recent  efforts  implicitly  include  conjunctions  through  the  use  of  phrases  as 
terms  in  otherwise  traditional  statistical  methods. 

An  altemative  is  the  use  of  boolean  operators.  This  has  the  desired  effect  —  an  AND  of 
terms  forces  a  conjunction  ~  but  the  use  of  booleans  in  IR  has  been  viewed  with  some 
skepticism  and  disfavor.  Boolean  operators  often  find  a  conjunction  of  terms  where  none 
truly  exists  (for  example,  Airbus  and  subsidies  might  be  mentioned  in  two  separate  and 
unrelated  portions  of  an  article);  or,  if  made  sufficiently  restrictive  to  eliminate  spurious 
matches,  boolean-based  searches  often  miss  relevant  articles. 

We  have  followed  an  approach  which  incorporates  both  ideas.  Rather  than  focus  on 
specific  phrases,  we  search  for  terms  in  proximity  to  one  another.  The  terms  in  the  query 
are  chosen  to  represent  each  of  the  constituent  sub-topics,  just  as  in  a  boolean  search.  The 
specificity  of  the  query  is  adjusted  by  varying  the  required  proximity  of  the  terms.  Thus, 
for  Airbus  subsidies  we  might  search  for  terms  representing  "Airbus"  in  a  range  of 
proximities  to  terms  representing  "subsidies". 

This  approach  allows  conjunctions  to  be  graded.  A  small  proximity  restriction  (say,  3 
words)  yields  results  similar  to  a  keyphrase  search,  indicating  that  the  two  concepts  are 
indeed  associated  in  the  article  and  that  the  article  is  relevant  to  the  topic.  A  large  proximity 
restriction  (1  article)  is  analogous  to  a  simple  boolean  keyword  search  and  retreives  articles 
in  which  the  concept  terms  may  be  only  loosely  associated.  Intermediate  proximities  (1 
sentence,  1  paragraph,  etc.)  indicate  intermediate  degrees  of  association  and  intermediate 
recall/precision  trade-offs. 

It  is  also  possible  to  use  multiple  proximities  in  a  single  query  with  this  method,  or  to  use 
proximities  and  occurrence  frequencies  together,  to  form  multi-dimensional  arrays  of  query 
parameters.  For  example,  for  Topic  62,  Military  Coups  D'etat,  the  number  of  conjunctions 
was  traded  off  against  the  proximity  of  the  conjunction  to  form  a  two-dimensional  query 
set. 

For  the  initial  experiment,  lists  of  synonyms  representative  of  each  idea  in  a  topic  were 
manually  built,  and  one-  or  two-dimensional  query  sets  were  built  from  these  lists.  These 
queries  were  then  run  against  the  training  database,  and  after  some  feedback,  the  query  sets 
were  finalized.  Each  finalized  query  set  was  run  against  the  training  database  to  determine 
a  ranking  of  the  queries  based  solely  on  selectivity. 

Table  I  shows  a  sample  proximity  query  and  Table  III  shows  our  TREC-II  results.  The 
number  of  relevant  documents  retrieved  by  the  proximity  method  queries  are  labeled 
TRWl. 
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4.0      Statistical  Query  Generation 


Our  second  set  of  experiments  revolved  around  the  use  of  term  weighting.  Our  basic 
approach  was  to  follow  the  well  researched  path  of  generating  term  weights  proportional  to 
the  occurance  of  words  in  a  sample  of  relevant  text  and  inversely  proportional  to  the 
occurance  of  words  in  the  database  as  a  whole.  Since  we  were  doing  the  routing  topics, 
we  gathered  statistics  on  Volume  II  and  hoped  that  they  would  be  valid  for  Volume  III.  For 
sample  relevant  documents  we  used  the  NIST  TREC-I  relevant  documents  from  Volume 
II.  We  did  not  use  the  topic  narratives  or  descriptions.  In  addition  to  single  word  tenns, 
we  also  considered  two  and  three  word  phrases.  We  used  the  FDF  itself  to  scan  the  training 
corpus  and  determine  the  phrase  frequencies  of  interest. 

To  adapt  this  standard  approach  to  the  FDF,  we  needed  to  make  three  algorithmic 
modifications.  First,  we  needed  to  adapt  to  the  limitations  imposed  by  the  FDF  hardware. 
While  extremely  effective  for  pattern  matching,  the  FDF  is  not  a  general  purpose  computer. 
While  the  FDF  processor  cells  can  perform  basic  addition/subtraction,  the  datapath 
available  to  accumulate  an  aggregate  score  for  a  document  is  limited  to  8  or  9  bits.  Thus 
we  had  to  restrict  the  term  weights  (and  the  range  of  their  sums)  to  integer  values  between 
0  and  255  or  0  and  511.  This  had  the  effect  of  truncating  most  topic's  query  terms  at  10- 
20  terms  (words,  phrases,  or  special  features).  We  also  excluded  terms  from  our  queries 
that  did  not  appear  in  at  least  30%  of  the  relevant  sample  documents. 

Second,  we  were  striving  to  not  give  up  the  strengths  of  the  FDF's  pattern  matching 
capabilities  to  pinpoint  special  features  in  the  text  which  have  a  large  impact  on  document 
relevance.  We  manually  reviewed  the  topics  and  prepared  special  feature  subqueries  in  an 
attempt  to  increase  the  precision  for  particular  topics.  For  Topic  59,  Weather  Related 
Fatalities,  we  manually  prepared  a  special  feature  subquery  to  detect  phrases  detailing  a 
numeric  value  of  people  killed.  We  determined  the  frequency  of  each  special  feature,  both 
in  the  sample  relevant  documents  and  in  the  training  database  as  a  whole,  and  just  added 
these  into  the  word  list  as  if  they  were  regular  single  word  terms.  In  some  instances  our 
manually  prepared  subqueries  jumped  to  the  top  of  the  list  of  statistically  relevant  terms  for 
a  topic;  in  others  they  didn't. 

Third,  we  observed  that  some  topics  had  particular  words,  phrases,  or  special  features  that 
were  present  in  almost  all  relevant  documents.  We  converted  terms  that  occurred  in  >  90% 
of  the  relevant  documents  to  boolean  ANDs  in  our  queries.  This  was  intended  to  improve 
precision  for  topics  like  62,  Military  Coups  D'etats.  The  topic  narrative  specifically  stated 
that  the  country  involved  must  be  named.  One  of  our  special  feature  subqueries  was  a  list 
of  known  foreign  country  names.  While  of  no  statistical  significance  as  a  term,  this 
subquery  did  hit  on  almost  every  sample  document.  We  thus  ANDed  it  into  the  query  as  a 
required  boolean  term. 

Table  II  shows  a  sample  statistical  query.  DocCount  is  the  number  of  documents  in  the 
sample  that  included  the  term,  phrase,  or  special  feature.  DbCount  is  the  number  of 
documents  in  our  training  sample  that  included  the  term.  Weight  is  DocCount  divided  by 
DbCount.  PslWeight  is  the  integer  coefficient  based  on  the  Weight.  The  relevant 
documents  retrieved  by  the  statistically  generated  queries  are  labeled  TRW2  in  Table  III. 
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TABLE  I  -  Sample  Proximity  Query 
Query  Template  for  Topic  75,  Automation 
/*  Concept  1:   automation  */ 

define  x_auto  ' automat [e I  ion | ed | es I ing]  '  end 

/*  Concept  2:   change  in  economics  */ 

define  x_up_dn     ' [increas I decreas] [e | es | ed I ing] ' 


' rais [e I es I ed I ing] • 
' ris [e I es I ing] ' 
' drop [ I s I ped] ' 
' dip [ I s I  ping]  ' 
' expan [d I ds I ded I sion [Is]] ' 


' lower [ I s I ed I ing] ' 
' fall [s I ing] ' 
'hike[ Isid] ' 
' cut [ I s I  ting]  ' 

' reduc [e I es I ed| tion [ I s] ] '  end 


define  x_econ  'cost[|s]'    I  'payroll*' 

|{   3  words  ->  'work'   and  'force')    }  end 

/* 

*  Query  template  used  for  Topic  7  5 
*/ 

{    (proxl)  -> 

@x_auto  and 

{    (prox2)   ->  @x  up  dn  and  @x  econ  }  } 


TABLE  II  -  Sample  Statistical  Query 
Term  weighting  table  for  Topic  93,   NRA  Political  Backing 
Doc  sample  size  is  150 

Term  DocCount        DbCount  Weight  PslWeight 


nra 

87 

0 

130 

0 

0 

66923077 

62 

national  rifle  a 

145 

0 

255 

0 

0 

56862745 

53 

assault  weapons 

51 

0 

161 

0 

0 

31677019 

30 

semiautomatic 

68 

0 

332 

0 

0 

20481928 

19 

gun_cont rol 

73 

0 

387 

0 

0 

18863049 

18 

handguns 

51 

0 

319 

0 

0 

15987461 

15 

rifle 

146 

0 

1336 

0 

0 

10928144 

10 

handgun 

52 

0 

550 

0 

0 

09454545 

9 

firearms 

63 

0 

811 

0 

0 

07768187 

7 

rifles 

46 

0 

1318 

0 

0 

03490137 

3 

gun 

126 

0 

3866 

0 

0 

03259183 

3 

guns 

96 

0 

3020 

0 

0 

03178808 

3 

law  enforcement 

51 

0 

2509 

0 

0 

02032682 

2 

assault 

67 

0 

3491 

0 

0 

01919221 

2 

ban 

91 

0 

5510 

0 

0 

01651543 

2 

enforcement 

54 

0 

4825 

0 

0 

01119171 

1 

weapons 

99 

0 

8993 

0 

0 

01100856 

1 

association 

148 

0 

16780 

0 

0 

00882002 

1 

legislation 

64 

0 

7984 

0 

0 

00801603 

1 

laws 

45 

0 

7662 

0 

0 

00587314 

1 
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TABLE  III  -  TRW/Paracel  TREC-II  Routing  Results 


NIST 

Relevant  Retr 

@100 

TRWl 

TRW2 

Qrv 

J: 

Rel . 

Best 

Median 

Worst 

Rel 

Rel 

Description 

51 

11 

11 

10 

0 

10 

10 

Airbus  Subsidies 

52 

454 

100 

97 

31 

99 

99 

S     Africa  Sanctiorm 

53 

154 

84 

50 

1 

52 

55 

Leveraged.  Buyouts 

54 

124 

65 

57 

0 

21 

47 

Sp^I"      T.^^nnr*}*!    Pont"  ol" 

55 

320 

100 

87 

0 

54 

96 

-L  X  i  O  ^           JL       X  ^  CI  v^-L  1 

56 

395 

96 

88 

3 

89 

96 

i.    X.  -±-  X I               X  X  U.  ^             X  X  Vu/  V  %^  1^ 

57 

319 

91 

73 

0 

61 

91 

MCI 

58 

76 

62 

39 

4 

49 

28 

Rail  Strikes 

59 

574 

97 

80 

1 

95 

95 

Weather  Fatalities 

60 

18 

9 

4 

0 

2 

3 

Merit— Pav  v?      Spniorit v 

61 

67 

60 

24 

2 

7 

24 

Israel  &  Iran—Contra 

62 

426 

92 

75 

9 

82 

81 

Military  Coups  D'etat 

63 

74 

40 

24 

0 

24 

29 

Machine  Translation 

64 

282 

68 

53 

1 

29 

57 

Host age— Taking 

65 

214 

55 

29 

0 

2 

46 

Info  Retrieval  Systems 

66 

86 

40 

19 

0 

20 

23 

NLP 

67 

365 

64 

52 

0 

10 

61 

Civil  Disturbances 

68 

76 

61 

38 

0 

58 

55 

Health  Hazards 

69 

1 

1 

0 

0 

0 

0 

Revi  VI  nn   ^ATiT   TT  Treatv 

70 

34 

32 

28 

0 

31 

32 

Surrocratp  Mothprhood 

71 

300 

73 

35 

1 

21 

32 

Border  Incursions 

72 

91 

43 

31 

0 

17 

30 

Demographic  Shifts/U.S. 

73 

355 

74 

49 

0 

23 

74 

Demographic  Shifts/World 

74 

323 

60 

32 

5 

6 

35 

Conflicting  Policies 

75 

372 

59 

28 

1 

59 

31 

Aut  omat  ion 

76 

163 

34 

24 

0 

11 

34 

Original  Intent 

77 

85 

49 

36 

0 

36 

40 

Poaching 

78 

83 

66 

58 

0 

40 

57 

Greenpeace 

79 

341 

85 

75 

0 

81 

77 

FRG  Party  Positions 

80 

143 

54 

8 

0 

54 

8 

Candidate  Platforms 

81 

4 

4 

2 

0 

1 

2 

PTL  Fallout 

82 
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89 

65 

0 

30 

87 

Genetic  Engineering 

83 

235 

64 

31 

2 

54 

25 

Protect  the  Atmosphere 

84 

101 

27 

16 

0 

22 

14 

Alternative  Energy 

85 

670 

88 

69 

38 

59 

84 

Official  Corruption 

86 

40 

27 

16 

0 

19 

12 

Bink  Failures 

87 

151 

75 

42 

2 

42 

36 

S&L  Prosecutions 

88 

32 

29 

21 

0 

11 

26 

Crudp  Oil   Price  Trends 

89 

17 

8 

2 

0 

3 

7 

OPEC  Investments 

90 

75 

39 

10 

0 

39 

9 

Oil  &  Gas  Reserves 

91 

9 

6 

3 

0 

5 

5 

Acq  of  Advanced  Weapons 

92 

27 

17 

6 

0 

7 

9 

Military  Equip  Sales 

93 

94 

65 

61 

9 

61 

62 

NRA 

94 

300 

76 

47 

0 

57 

62 

Computer  Crime 

95 

359 

77 

45 

1 

63 

40 

Computer  Crime  Detection 

96 

310 

86 

42 

2 

23 

49 

Computer  Medical  Diag 

97 

319 

59 

21 

0 

20 

41 

Fiber  Optics  Appl 

98 

722 

96 

67 

0 

83 

84 

Fiber  Optics  Manuf 

99 

291 

99 

92 

10 

91 

89 

Iran-Contra 

100 

204 

94 

84 

0 

52 

89 

Controlling  High  Tech 
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5.0      Analysis  of  Results 

The  results  from  our  two  TREC  runs  (Table  HI)  are  summarized  below.  The  proximity 
queries  (TRWl)  scored  at  or  above  the  median  on  28  topics  (including  three  topics  which 
achieved  the  best  score)  and  below  median  for  22  topics.  While  not  bad,  our  proximity 
queries  did  not  perform  as  well  as  we'd  hoped.  Where  the  proximity  queries  did  poorly, 
we  attribute  this  primarily  to  poor  term  selection.  One  such  case  was  Topic  65,  Information 
Retrieval  Systems.  We  made  two  errors:  first,  due  to  an  oversight,  one  of  the  manually 
entered  query  terms  was  overly  broad;  second,  the  query  author  considered  database 
systems  to  be  "information  retrieval  systems".  We  feel  that  this  is  a  fault  of  the  query 
formulation,  not  of  the  assessments.  If  we  had  checked  the  training  assessments,  we  would 
have  eliminated  the  terms  from  our  query.  Any  system  which  relies  solely  on  the  topic 
statement  will  run  afoul  of  this  problem.  Systems  which  make  use  of  user-supplied 
relevance  information  will  achieve  better  performance.  Our  results  again  demonstrate  the 
inherent  problems  with  basically  boolean  query  formulations.  Our  efforts  at  using 
proximity  to  soften  the  boolean  were  not  sufficient  to  overcome  this  weakness. 

Run  High        Above  Med      Median      Below  Med    Low  11 -pt  avg 

TRWl  3  19  6  22  0  0.2525 

TRW2  5  27  5  13  0  0.3459 

The  statistical  queries  (TRW2)  did  much  better,  scoring  at  or  above  the  median  on  37  of  the 
topics.  The  1 1-pt  average  and  total  relevant  documents  retrieved  figures  were  excellent  and 
are  close  to  the  best  academic  groups.  The  adaptations  to  run  the  statistical  queries  on  the 
PDF  hardware  evidently  did  not  hurt  performance.  We  again  observed  that  the  dominant 
factor  in  achieving  good  performance  is  proper  term  selection.  The  details  of  the  term 
weight  calculations  didn't  seem  to  make  much  difference  except  to  influence  which  terms 
were  selected.  We  tried  a  number  of  different  schemes  for  generating  term  weights 
including  using  various  statistical  parameters,  log  weighted  coefficients,  and  converting 
terms  present  in  all  sample  documents  to  boolean  ANDs  in  the  query.  For  a  given  set  of 
terms,  we  did  not  find  much  difference  in  performance  between  these  schemes.  The 
scheme  used  for  TRW2  was  one  of  the  simpler  ones  we  tried. 

We  were  also  interested  in  evaluating  the  use  of  phrases  and  special  features  as  additional 
terms  in  our  statistical  queries.  They  seemed  to  help,  but  not  dramatically.  This  was  a 
disappointment.  Lx)oking  at  the  results  topic  by  topic  however,  we  observed  a  lot  of 
variation.  For  some  topics,  the  addition  of  a  key  phrase  or  special  feature  helped  a  great 
deal.  This  indicates  that  use  of  phrases  and  special  features  has  promise  for  improving 
performance,  but  that  we  just  have  not  learned  how  and  when  to  employ  them.  For 
example,  our  term  weighting  scheme  this  year  didn't  account  for  term  interdependence. 
Particularly  when  we  start  mixing  single  word  terms  with  phrases  and  special  features  that 
contain  those  same  terms,  it  would  seem  the  algorithm  could  be  improved  by  explicidy 
accounting  for  this  redundancy. 
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6.0      Conclusions  and  Future  Plans 


Overall,  our  results  for  TREC-2  are  encouraging.  We  were  successfull  in  adapting  the  PDF 
hardware  to  running  a  soft  statistical  information  retrieval  algorithm.  Not  only  did  the  FDF 
run  all  the  final  searches  with  no  significant  post-processing,  we  also  found  it  a  useful  tool 
for  experimenting  with  different  proximity  windows  and  deriving  database  frequencies  for 
phrases  and  special  features.  We  plan  to  continue  our  work  and  to  participate  in  TREC-III 
next  year.  The  lessons  we  learned  from  TREC-I  have  proven  very  valuable,  and  so  too 
should  the  lessons  of  TREC-II.  For  TREC-III,  we  will  continue  to  examine  the  range  of 
available  query  formulation,  execution  and  evaluation  models,  with  an  eye  to  adapting 
those  models  for  use  with  the  FDF  search  hardware.  We  are  looking  at  methods  to  use  the 
raw  horsepower  of  the  FDF  to  expand  existing  techniques  in  ways  which  had  previously 
been  considered  too  computationally  expensive. 

We  are  also  struck  by  the  observation  that  different  query  generation  techniques  are 
effective  on  different  topics.  A  system  that  could  execute  a  range  of  methods  might  be  able 
to  match  the  query  generation  approach  to  the  topic  type.  We  plan  to  consider  possible 
schemes  for  characterizing  the  topic  in  advance  so  that  the  system  might  be  able  to  use  the 
best  method. 
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Knowledge-Based  Searching  with  TOPIC® 

John  W.  Lehman,  Clifford  A.  Reid,  et  al. 

Verity,  Inc. 
1550  Plymouth  Street, 
Mountain  View,  CA  94043 
(415)  960-7620  /  jlehman@verity.com 


1 .  OBJECTIVE  OF  VERITY'S  TREC-2 
EXPERIMENTS 

Verity,  Inc.  is  the  first  major  commercial  product 
participant  in  TREC.  Verity's  product  is  TOPIC®. 

Verity  participated  in  TREC-2  as  a  Category  A  Site. 
This  participation  was  Verity's  first  TREC,  and  we 
encountered  many  of  the  logistical  problems  of  other 
sites  in  their  TREC-1  experience. 

Topic's  search  users  wish  to  understand  the  search  result 
quality  to  expect  in  their  personal  searches  on  their 
(large)  collections.  Verity  also  expects  to  obtain  insights 
for  future  product  improvements. 

Topic  is  a  mature  commercial-off-the-shelf  manual  text 
search  program  combining  the  results  of  human 
expertise  with  a  powerful  search  expression  language  and 
fast  search  algorithms.  Topic's  installations  use 
manually  or  semi-automatically  developed  libraries  of 
searches  (topics) ,  which  are  instances  of  the  search 
expression  language  and  which  are  supplied  to  all  users. 

Verity  begins  its  TREC  experiments  with  a  gathering  of 
"ground  truth"  regarding  unaided  adhoc  end  user  search 
result  quality.  Future  experiments  will  incorporate 
predefined  searches  (topics')  and  other  Topic  search  aids  to 
determine  their  level  of  improvement/impact  on  search 
result  quality. 

2 .  TOPIC  SEARCH  APPROACH 

The  Topic  philosophy:  Domain  knowledge,  both 
descriptive  and  content-based,  using  constructs 
specifically  designed  to  discriminate  between  fiill  text 
material,  is  the  only  way  to  consistently  obtain  high 
recall/precision  on  large  heterogeneous  collections. 
Search  result  quality  may  be  enhanced  by  the 
employment  of  collection-specific  statistics  to  locate 
additional  domain-relevant  terminology.  Searches  are 
repeated  and  subject-matter  expertise  is  a  scarce  resource. 
The  problem  that  Topic  addresses  is  the  effective  use  of  a 
human's  time  in  analyzing  search  results  to  locate  the 


preponderance  of  relevant  details  in  the  fewest  possible 
documents,  and  therefore  the  smallest  possible  elapsed 
time. 


2 . 1      TOPIC  KNOWLEDGE 
REPRESENTATION 

The  Topic  product  employs  several  approaches  to 
individual  term  search,  organized  by  a  rule-based,  or 
concept-based,  approach  to  search  term  aggregation.  In 
Topic,  the  search  focus  is  the  topic,  (concept,  notion, 
idea,  or  subject),  and  the  topic  is  the  user-specified 
"smart"  description  of  all  of  the  evidence  "about"  or  "of 
the  topic  as  it  (the  evidence)  would  be  found  in  text 
documents. 

2.1.1  TOPIC  INDICIES 

The  Topic  product  Une  catalogs  and  indexes  both  fielded 
(structured)  data,  and  full-text.  Topic  automatically 
extracts  structured  data  (such  as  title,  author,  etc.)  into 
searchable  fields,  using  a  lexical  analyzer.  Fielded  data  is 
searchable  separately  or  in  combination  with  full-text. 

Indexes  on  the  full-text  are  (for  all  non-stopped  characters 
and  strings): 

-word/string 

-stemmed  word  (morphological  variant) 
-soundex  (phonetic  spelUng  variety) 
-statistically  correlated  terms  (called  the 
suggestion  index) 
-typographical  error  index 
-thesaurus 

-wildcard  (universal  character/group  expansion) 

An  index  on  all  values  (choices)  for  fielded  data  is  also 
produced. 

2.1.2  TOPIC  SEARCH  RULES 

Search  rules  consist  of  relational  comparisons  to  field 
values,  exact  or  fuzzy  matches  on  full-text  search  terms, 
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aggregated  by  boolean  and  evidential  reasoning  operators 
and  point  value  uncertainty  at  the  term  level  (each  piece 
of  evidence  has  a  strength/uncertainty  attached  to  its 
predictability  of  its  parent  concept). 

Topic  provides  search  rule  management  functions  to 
support  the  creation,  repeated  use,  modification,  sharing 
and  display  of  one  or  more  libraries  of  related  search 
rules.  The  search  rule  libraries  are  themselves  searchable, 
including  text  annotations  of  the  rules.  Search  rules  are 
interactive  queries,  automatic  queries,  and  a  training 
mechanism  for  the  installation's  domains.  ^ 

A  search  rule  definition  may  include  several  thousand 
pieces  of  evidence  in  over  one  hundred  levels  of  detail. 
One  search  rule  library  may  contain  twenty  thousand 
rules.  Search  rules  (topics)  are  named,  and  a  reference  to 
the  name  in  a  search  expression  inherits  all  lower  levels 
of  evidence.  Any  query  which  includes  a  search  rule 
name  will  automatically  receive  the  full  definition  of  the 
rule  in  the  search.  The  lowest  level  of  evidence  is  the 
text  expression.  Search  rules  may  be  composed  of  other 
named  search  rules. 

Search  rules  appear  as  an  alphabetical  list  of  topic 
names,  an  indented  outline  showing  the  levels  of  rules, 
or  a  graphical  "family  tree"  display  of  rules  and  their 
parents/children,  including  evidence  combination 
operators  and  evidence  "weights".  Searches  may  be 
executed  directly  from  any  node  (name)  in  the  search  rule 
family.  A  topic  search  rule  graphic  display  example 
appears  in  Figure  1. 

The  search  rule  syntax  consists  of  an  exact  or  fuzzy 
match  (pattern  match)  capability  for  individual  terms 
(case  sensitive);  a  boolean  combination  (and  (all),  or 
(any),  not),  of  terms;  dual  direction,  nested,  grammatical 
(paragraph,  sentence,  phrase)  proximity  operators;  a 
relative  (fuzzy)  proximity  operator  for  two  or  more 
terms,  an  evidence  aggregation  operator  (accrue)  for  both 
full-text  and  structured  field  data,  and  inexact  match 
techniques  as  follows: 

1.  wildcard  expressions  for  term  expansion;  single 
character,  character  group,  or  character  class 

2.  soundex  (first  letter  common)  expressions  for 
morphological  term  expansion 

3.  source  language-specific  stemming  (morphological 
variants)  expressions  for  term  expansion 

4.  typographical  expressions  for  term  expansion  (n- 
character  infidelity  to  search  term) 

5.  multi-direction  thesaurus  (user-modifiable)  for  term 
expansion 

6.  suggestion  (statistical  correlation)  for  term 
expansion 


7.  evidence  appearing  in  a  field  value,  or  as  the  field 
value  (contains,  matches,  substring,  starts,  ends). 

Each  of  the  above  inexact  match  techniques  may  be 
executed  automatically.  Negative  evidence  may  be 
applied  on  a  term-by~term  basis  with  any  operator.  The 
structured  field  data  types  are  character,  number  and  date. 
Date  arithmetic  is  provided,  as  well  as  relative  date 
expressions  such  as  "yesterday" , "today"  etc. 

2.1.3   SEARCH  RESULT  RANKING 

Results  of  searches  are  relevance  ranked  lists  of 
documents,  with  displayed  titles  or  other  descriptive 
information.  The  numeric  score,  and  the  accompanying 
rank,  are  the  result  of  a  best  fit  comparison  of  the  full- 
text  document  and  descriptor  content  and  the  search  rule 
evidence.  The  ranking  is  subject  to  an  optional 
threshold,  used  primarily  to  limit  output,  but  the 
threshold  may  be  used  to  describe  search  recall  and 
precision.  The  relevance  threshold  is  always  used  in 
dissemination/notification. 

Evidence  consists  of  terms,  operators  (syntax)  and  the 
numeric  strength  of  the  relationship  between  the 
evidence  and  its  (next  higher  level)  search  rule.  The 
evidence  may  be  aggregated  or  evaluated  with  boolean 
operators.  Aggregation  involves  giving  relevance  score 
credit  for  each  piece  of  evidence  found  (breadth  of 
evidence  first).  As  each  level  is  evaluated  in  a  search  rule 
(tree),  potential  document  score  modification  occurs 
(since  successive  levels  may  be  weighted  evidence  for 
their  next  broader  concept).  The  scoring  of  an  individual 
term  may  include  a  frequency-of-occurrence  factor  (a 
normalized  concentration  factor) ,  a  less  powerful  scoring 
factor  than  the  absolute  presence  of  the  evidence  in  the 
document.  A  document  score  explanation  function  is 
included. 


2 . 2      AGGREGATE  SEARCH 
FUNCTIONS 

Searches  may  iterate  on  the  results  of  the  previous 
search. 

Any  search  may  be  named/saved  along  with  its  results 
manipulation  criteria  (sorting  by  fields,  grouping)  for 
later  execution.  Any  search  criteria  may  be  interactively 
defined  as  a  logical  view  of  the  collection,  which  then 
provides  many  alternative  search  universes  for  the  user 
population.  All  Topic  activities  are  audited.  A  search 
which  supports  discretionary  access  control  may  be 
transparently  appended  to  any  users  search. 
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Figure  1 
TOPIC  Search  Rule  and  Result 
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2.3  USER  INTERFACE  TO  SEARCH 

Every  search  is  automatically  configured  into  a  rule.  The 
simplest  search  is  a  list  of  terms,  which  may  be  entered 
at  the  keyboard,  selected  from  displayed  document(s) 
content,  or  selected  from  lists  of  terms.  This  list  is 
automatically  enhanced  by  term  expansion,  expansion  to 
existing  named  rules  whenever  the  rule  name  appears  in 
the  search  expression,  and  evidence  aggregation.  Searches 
involving  structured  fields  are  generally  addressed  by  a 
form  interface,  which  aggregates  field  and  full-text 
content.  Any  list  of  terms,  rule-names,  or  extensions 
such  as  thesaurus/soundex  may  be  used  to  initiate  a 
search  or  add  to  a  search  expression. 

2.4  SEARCH  RESULT  ANALYSIS  AIDS 

The  Topic  philosophy  of  minimizing  the  elapsed  time  to 
obtain  the  necessary  relevant  details  that  constitute  an 
answer  or  support  a  decision  necessitates  analysis  aids 
beyond  the  search  composition  and  result  list  display. 

The  Topic  result  list  may  be  browsed  (page,  result 
number  etc.).  A  document  selected  for  display  produces 
the  full  text  display  with  all  search  evidence  highlighted 
(e.g.  in  reverse  video  or  color).  The  display  may  be  the 
native  form  of  the  document,  which  for  most  of  today's 
collections  means  a  marked-up  format  with  usefiil  user 
guidance  in  the  markup  itself  (e.g.  sections,  paragraph 
headings  etc.).  The  user  may  choose  to  browse  or  to 
move  directly  to  the  first/next/previous  occurrence  of  a 
search  term  in  the  document.  Similarly,  the  user  may 
move  through  the  document  using  various  document 
enhancements  such  as  hypertext  links,  may  follow 
hypertext  links  to  other  documents,  including  graphics 
and  other  media.  Previously  generated  annotations  are 
available  for  browsing.  Queries  or  other  applications 
may  be  linked  to  document  content.  A  specific  search 
term  (not  necessary  to  be  a  part  of  original  search)  may 
be  used  as  a  browsing  aid  to  the  document. 

2.5  SECURITY 

Users  may  be  prevented  from  accessing  information  via 
operating  system  permissions,  and  built-in  access 
controls,  including  discretionary.  The  product  processes 
have  been  certified  at  system  high  in  many  installations, 
and  some  sponsors  have  applied  for  MLS  certifications 
based  upon  the  delivered  product. 

2.6  DATA  ARCHITECTURE/ 
PERFORMANCE/  CONHGURATION 

Topic  enables  the  logical  division  of  a  collection  of 
documents  into  "partitions",  which  are  document 
descriptions  and  indexing  data  about  the  arbitrary/ 
intentional  subset.  Partition  size,  purpose  and 


characteristics  are  under  the  application  administrator's 
control.  The  raw  documents  are  not  "owned"  by  the 
Topic  application.  Topic  will  produce  indicies  which  are 
approximately  70%  of  the  size  of  the  native  text  size 
(the  TREC-2  index  size  was  approximately  50%).  This 
includes  fielded,  word,  and  subject  (rule  evidence)  level 
indicies. 

The  partition  data  is  platform-independent  (i.e.  the 
documents  and  their  associated  partitions  may  be 
moved/accessed  from  any  Topic  platform. 

Searches  may  be  performed  on  the  served  desktop,  on  a 
host  or  both. 

Normal  performance  on  a  personal  computer  is  in  the 
thousands  of  document-rule  nodes  per  second,  up  to 
many  tens  of  thousands  of  nodes  per  second  on  current 
workstations.  The  search  rule  low  level  evidence  is 
contained  in  a  size/speed-optimized  index  (topics),  which 
is  essential  to  rapid  response  on  complex  rules.  This 
index  is  automatically  modified  each  time  topic  evidence 
is  added,  so  the  word  positional  information  is  searched 
only  on  the  first  use  of  the  term.  The  topics  index 
normalizes  document  size  so  that  all  search  response 
times  are  predictable.  Partitions  enable  incremental 
(ranked)  results,  guaranteeing  few-second  time-to-first- 
result,  regardless  of  the  size  of  the  collection.  The 
response  characteristic  which  Topic  optimizes  is  the 
time-to-first-meaningful-result.  The  rule  evidence  index 
may  be  centralized  or  distributed,  and  when  distributed,  it 
provides  the  ability  to  produce  a  ranked  results  list  with 
a  minimum  of  network  access. 

Integration  with  third  party  components  is  available 
from  the  end  user  interface,  or  shared  libraries.  The 
program  provides  logical  links  between  document-image, 
document-document,  document-annotation,  document- 
search  request.  Some  links  may  be  automatically 
determined  at  indexing  time  (image,  cross-reference). 

The  structured  field  values  may  be  entered  interactively, 
or  filled  automatically  from  a  lexical  analyzer.  The 
program  provides  an  enduser  process  interface  between 
scanning,  OCR/ICR  and  indexing. 

3 .       THE  TREC  EXPERIMENTS 

3 . 1      DATA  PREPARATION 

The  TREC-2  texts  data  preparation  processing  was 
performed  on  a  Sun  SPARC  10  (UNIX  4.1.3). 
Cataloguing  and  indexing  was  performed  at  the  rate  of 
approximately  100  Mbytes  per  hour.  This  process 
included  the  automatic  extraction  of  10  fields  from  the 
ASCn  content.  Partitions  were  set  at  8000  documents 
for  all  data.  There  were  no  processing  errors. 
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No  markup  language  (SGML)  interpreter  was  used 
during  data  preparation,  and  the  optional  alphabetical 
word  list  (used  only  for  display)  and  typographical  error 
index  (used  almost  exclusively  for  OCR'd  data)  were  not 
employed.  Special  indicies  such  as  correlated  terms,  and 
paragraph/sentence  positioning  were  not  produced.  As 
the  fuzzy  proximity  operator  was  used  in  the  tests,  only 
a  word  position  index  was  produced.  No  document  was 
divided  into  logical  or  arbitrary  sections  for  processing  or 
search  result  enhancement,  although  that  approach  is 
used  in  virtually  all  non-newswire  Verity  installations. 
The  purpose  of  logical  division  (a  forerunner  of  the 
intelligence  available  in  a  standard  markup  language)  is 
to  create  domain-specific  logical  documents,  and 
therefore  to  reduce  the  impact  of  larger,  multi-subject 
documents  on  results  (they  would  appear  in  search 
results  simply  because  of  their  breadth  of  words). 

3.2  TOPIC  CONSTRUCTION 

Verity  personnel  manually  constructed  the  search  rules 
from  the  subject  area  descriptions  and  the  training  data. 
No  rule  developer  was  identified  or  chosen  as  a  subject 
matter  expert,  and  for  certain  of  the  contributors,  this 
was  their  initial  interface  with  using  Topic.  [Search  rule 
libraries  are  created  by  approximately  6%  of  Topic's  user 
population  and  the  remainder  of  Topic's  users  employ 
the  topics  developed  by  others].  On  the  average,  the 
TREC-2  volunteers  were  considered  novices  on  the 
Topic  product,  particularly  the  search  rule  development 
area.  Volunteers  were  not  encouraged  to  use  specific 
features  of  the  product,  and  in  at  least  one  case, 
inadequate  communication  produced  potentially 
inaccurate  search  expectations.  As  search  rules  were 
interactively  developed,  the  rule  evidence  was 
automatically  indexed  for  repeated  use  of  the  rule.  The 
twenty  volunteers  each  produced  between  3  and  8 
retrospective  and  routing  queries.  The  range  in  time 
spent  on  individual  query  development,  and  result 
production  was  from  fifteen  minutes  to  eight  hours,  over 
a  several  week  period.  The  average  time  to  produce  the 
TREC-2  result,  obtained  from  interviewing  the 
volunteers,  was  approximately  one  hour. 

3 . 3  EXPERIMENT  PERFORMANCE 

Typical  response  time  performance  on  the  searches  was 
two  seconds  per  8000-document  partition,  or 
approximately  two  minutes  to  search  the  entire 
collection.  A  single  term,  indexed  as  rule  evidence,  was 
used  to  search  the  entire  collection,  and  the  1.1  million 
document  collection  was  searched  in  21  seconds. 


3 .4      ANALYSIS  OF  OFHCIAL  RESULTS 

The  post  hoc  analysis  of  Topic's  TREC-2  results 
generally  found  that  the  Topic  system  performed  well. 
When  compared  with  other  manual  systems,  the  scores 
are  amongst  the  best.  I  the  few  cases  where  Topic 
appeared  to  fail,  we  have  generally  been  able  to  identify 
easily  correctable  deficiencies,  that,  had  they  been  noticed 
during  the  experiment  proper,  would  have  resulted  in 
superior  performance  by  Topic  in  TREC-2. 

Based  on  our  analysis,  we  believe  that  the  prospects  for 
TREC-3  look  very  bright. 

Our  analysis  of  selected  results  from  our  TREC-2 
submissions  focuses  mainly  on  the  "failure  cases"  since 
these  are  most  likely  to  give  us  insights  in  how  to 
improve  Topics  (and  users)  performance  in  future  TREC 
experiments.  This  also  allows  us  to  investigate  whether 
there  are  any  fundamental  issues  with  using  Topic  to 
model  the  information  need  statements  used  in  TREC. 

We  analyzed  two  routing  and  three  ad-hoc  topics  in 
detail.  Our  summary  follows. 

The  following  general  observations  applied  to  all 
searches: 

-Adhoc  searches  were  submitted  against  all  three  disks, 
which  produced  poorer  quality  results  generally,  as 
documents  from  disc  three  appeared  in  some  search 
results.^ 

-Field  value  evidence  was  not  used,  and  in  some 
domains/subject  areas,  domain  knowledge  about  the 
sources  of  information  would  favor  (rank  higher)  sources 
with  the  appropriate  use  of  terminology,  (e.g.  business 
sources  about  financial  performance,  or  foreign  datelines 
have  higher  likelihood  of  describing  foreign  prominent 
persons/activity,  as  in  topic's  66  or  121) 

-The  queries  which  used  attempted  to  use  nomenclature 
with  hyphens  (e.g.  M-1)  failed  to  return  an  exact  match 
as  the  hyphen  was  not  included  as  an  indexed  character. 

-The  fuzzy  proximity  (  near)  operator  was  undocumented, 
only  one  volunteer  used  it  and  other  users  expected 
sentence  /  paragraph  proximity  in  their  searches.  The 
index  did  not  contain  sentence  /  paragraph  positional 
data,  and  all  uses  of  sentence  or  paragraph  operators 
produced  erroneous  results  because  the  search  arbitrarily 
assigned  sentence  and  paragraph  boundaries. 


For  routing  queries,  the  score  threshold  was  set  to  zero; 
any  document  containing  evidence  entered  the  routing 
result  list. 


^Reprocessing  the  adhoc  searches  against  only  disks  1  and 
*2  produced  a  numberic  result  improvement  of  0-70 
percent,  with  a  *few  changes  from  under  the  median  to  over 
the  median. 
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3.4.1   ROUTING  TOPICS 

Overall,  Topic's  performance  on  the  routing  topics  was 
rather  good.  We  count  that  21  of  the  50  results  were  at 
or  above  median,  and  three  were  actually  the  bet  score. 
Most  of  the  other  results  were  on  the  low  side  of  the 
median.  The  relevant  comparison  to  the  median  is 
summarized  in  Figure  2.  The  exceptions  were  topics  66, 
67,  69,  74,  90  and  91,  for  which  the  Topic  search  used 
could  be  said  to  have  failed.  Several  of  these  were 
straightforwardly  explained.  For  example,  in  the  case  of 
topic  67  the  wrong  results  were  submitted.  Our 
independent  scoring  of  the  correct  results  set  would  give 
the  Topic  search  below  median  score.  For  topic  69  there 
was  in  fact  only  one  relevant  document,  but,  at  least  in 
our  reading  of  the  definition,  this  seems  to  be  a  false 
positive.  In  the  case  of  topics  90  and  91  the  Topic 
search  definitions  were,  in  our  opinion,  over-constrained. 
Further,  in  the  case  of  topic  91  an  index  creation 
decision  prevented  a  quite  reasonable  Topic  definition 
from  performing  as  well  as  it  could.^  The  other  two 
topics  are  of  more  interest. 

No  clear  pattern  emerged  between  the  type  of  search 
although,  in  the  routing  augmentation  category,  the 
Topic  performance  was  well  above  the  median  on  20  of 
33  searches. 

3.4.1.1  ROUTING  TOPIC  66 

A  relevant  document  for  this  topic  is  one  that  identifies  a 
type  of  natural  language  processing  technology  that  is 
being  developed  or  marketed  in  the  United  States.  The 
original  definition  of  the  Topic  is  basically  a 
conjunction  (AND)  of  a  natural  language  concept  and  a 
products/technology  concept. 

Performance  was  very  poor,  viz: 
Relevant  =  86 
ReLret  =  1 
R-Precision  =  0.0000 

Inspection  of  the  Topic  revealed  that  one  of  the 
conjuncts  (the  products/technology  concept)  had  a  weight 
of  0.05  -  thus  effectively  limiting  the  range  of  scores 
that  Topic  could  produce  to  be  in  an  extremely  narrow 
range. 


"^This  topic  is  about  the  acquisition  of  advanced  weapons 
by  the  U.S.  Army.  One  of  the  weapons  systems  mentioned 
in  the  information  need  statement  is  the  M-1  tank.  This  was 
included  in  the  Topic  definition  as  the  word  "M-1";  but 
since  the  "-"  symbol  was  interpreted  as  with  like  space  at 
database  build  time,  there  was  no  possibility  of  retrieving 
documents  based  on  "M-l"  as  a  word. 


We  changed  the  0.05  to  0.5  and  produced  the  following: 

Relevant  =  86 

Rel_ret  =  44 

R-Precision  =  0.2442 
which  is  a  median  result. 

We  concluded  that  for  Topics  to  be  effective  we  need  to 
ensure  a  sufficient  range  of  scores  to  give  us  the 
discrimination  needed  for  the  TREC  scoring  algorithm. 

3.4.1.2  ROUTING  TOPIC  74 

A  relevant  document  for  this  topic  is  one  that  cites  an 
instance  in  which  the  U.S.  Government  propounds  two 
conflicting  or  opposing  policies.  The  routing  task  is 
complicated  because  this  conflict  may  not  necessarily  be 
mentioned  in  the  same  document. 

In  our  opinion,  this  is  a  case  where  no  amount  of 
sophistication  in  Topic  construction  would  enable  Topic 
to  do  very  Well.  The  information  need  is  simply  outside 
the  scope  of  a  retrieval  system  that  uses  non-NLP 
techniques.  The  best  one  could  hope  for  is  to  model  a 
document  that  talks  about  the  meta-idea  of  conflict  (i.e., 
find  documents  that  talk  about  the  US  having  conflicting 
policies,  rather  than  documents  that  reference  the  specific 
conflicting  policy).  This  is,  in  fact,  what  was  done  in 
the  original  submission.  The  results  were: 

Relevant  =  323 

ReLret  =  18 

R-Precision  =  0.0464 
which  is,  of  course,  rather  poor. 

The  original  statement  of  need  actually  mentions  three 
examples  of  conflicting  policies  so,  as  an  experiment, 
we  ran  the  following  query: 

*  <Many><Stem> 

/wordtext  =  "tobacco" 

*  <Many><Stem> 

/wordtext  =  "pesticide" 

*  <Many><Phrase> 

*  <Many><Stem> 

/wordtext  =  "infant" 

*  <Many><Stem> 

/wordtext  =  "formula" 
that  is,  just  an  ACCRUE  of  "tobacco",  "pesticide"  and 
"infant  formula"  (which  the  modification  that  the 
<Stem>  and  <Many>  operators  produce. 

This  gave  the  following  results: 

Relevant  =  323 

ReLret  =  107 

R-Precision  =  0.2660 
which  puts  the  score  slightly  above  median.  We  expect 
that  most  TREC-2  participant  sites  probably  did  just 
this,  and  those  that  did  much  better  than  median  found 
some  other  specific  examples  of  a  conflicting  policy  and 
modeled  these  in  their  routing  queries. 
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Figure  2 

TOPIC2  Relevant  vs.  Median  -  Routing  Topics 
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3.4.1.3  TOPIC  67 

Our  analysis  located  weak  topic  formulation  examples, 
such  as  query  67,  illustrated  in  Figure  4.  In  this  query,  a 
set  of  optional,  auxiliary  evidence  was  "ANDed"  with  a 
small  set  of  required  evidence.  The  weight,  or  strength 
assigned  to  the  auxiliary  evidence  was  .05,  which  means 
that  if  all  auxiliary  terms  were  located,  the  highest 
possible  score  for  a  document  would  be  .05,  severely 
limiting  the  range  of  scores,  and  thus  the  occurrence  of 
random  false  hits  in  the  top  1000. 

To  make  a  cosmetic  improvement,  only  the  value  of  the 
auxiliary  evidence  node  was  changed,  to  a  value  of  .5,  as 
shown  in  Figure  5.  This  change  alone  brought  the  Topic 
relevant  document  count  to  the  median. 


3.4.2   AD  HOC  TOPICS 

Overall,  Verity's  performance  on  the  ad  hoc  topics  was 
adequate.  Performance  was  poorer  than  on  the  routing 
topics,  but  this  is  to  be  expected  since  there  was  less 
time  available  to  build  the  Topics  and  no  ground  truth 
against  which  to  test  the  Topic  trees.  The  relevant 
comparison  to  the  median  is  summarized  in  Figure  3. 
We  count  that  13  of  the  50  results  are  at  or  above 
median.  In  contrast  thought,  there  were  only  two 
outright  failures  here,  topics  124  and  139.  We  did  not 
look  at  topic  139,  but  topic  124  involves  searching  for 
documents  that  discuss  innovative  approaches  to  cancer 
therapy  that  do  not  involve  any  of  the  traditional 
treatments.  This  is  a  very  hard  topic  because  nearly  all 
mentions  of  the  innovative  treatments  are  in  the  context 
of  discussion  of  traditional  therapies  The  approach 
adopted  by  Verity  of  simply  looking  for  documents  that 
talk  about  innovative  treatment  produces  a  large  number 
of  false  hits  (giving  poor  precision),  and  since  there  is  an 
artificial  cut-off  at  1000  documents  in  the  TREC 
experiments,  this  model  also  produces  poor  recall.  We 
do  not  see  an  obvious  solution  to  this. 

We  picked  three  ad  hoc  topics  to  analyze  in  detail. 

3.4.2. 1  AD  HOC  TOPIC  109 

A  relevant  document  for  this  topic  simply  needs  to 
mention  one  of  a  list  of  six  companies  given  in  the 
information  need  statement.  A  simple  Topic  that  is  the 
disjunction  (OR)  of  the  company  names  should  be  all 
that  is  needed  here.  However,  the  official  result  is: 

Relevant  =  742 

Rel_ret  =  192 

R-Precision  =  0.2588 
which  is  well  below  median,  furthermore,  given  the 
simplicity  of  the  topic,  this  is  surprisingly  low  recall. 


Examination  of  the  official  Topic  showed  that  company 
acronyms  we  used  for  three  of  the  companies  (i.e.,  3M, 
OTC,  ISI)  were  given  equal  weight  to  the  fully  spelled 
out  company  names.  A  cursory  review  of  the  original 
hit  list  showed  that  ISI  was  a  poor  choice  since  it  has 
multiple  interpretations.  Less  important,  but  for  the 
same  reason,  OTC  is  a  poor  choice  in  the  Wall  Street 
Journal  corpus  since  it  can  mean  "over  the  counter",  and 
in  the  DOE  corpus  3M  is  part  of  a  designator  for  a 
particular  particle  accelerator  and  is  also  used  as  an 
abbreviation  for  "three  meters". 

We  modified  the  Topic  by  eliminating  the  ISI  acronym 
and  by  giving  OTC  and  3M  reduced  weights.  This 
produced  the  following: 

Relevant  =  742 

ReLret  =  480 

R-Precision  =  0.5512 
which  would  have  been  the  best  score. 

An  interesting  note  here  is  that  original  and  modified 
Topics  had  perfect  precision  and  recall  for  the  first  100 
documents.  Our  conclusion  is  that  this  indeed  was  an 
easy  topic  -  the  false  hits  produced  by  ISI  were  what 
impacted  Topics  score. 

3.4.2.2  AD  HOC  TOPIC  121 

A  relevant  document  for  this  document  had  to  mention 
the  death  of  a  prominent  U.S.  citizen  due  to  an  identified 
form  of  cancer. 

This  is  an  interesting  topic  consisting  of  two  major 
components  -  the  idea  of  a  prominent  citizen,  and  the 
idea  of  a  specific  cancer. 

In  the  official  Topic,  prominence  was  modeled  using  a 
number  of  words  that  indicate  prominence  (e.g., 
"prominent",  "celebrity")  together  with  words  that 
indicate  prominent  roles  (e.g.,  "Nobel  Prize",  "actor", 
"actress").  Cancer  death  was  modeled  by  various 
combinations  of  death  words  (e.g.,  "death",  "died")  and 
cancer  words  (e.g.,  "cancer",  "tumor",  "leukemia").  The 
official  score  was: 

Relevant  =  55 

ReLret  =  27 

R-Precision  =  0.1455 
which,  while  not  good  in  absolute  terms,  was  well 
above  the  median. 

We  observed  two  problems  with  this  definition.  First,  it 
uses  generic  cancer  terms  rather  than  the  specific  cancer 
types  required  by  the  information  need  statement.  So, 
we  made  all  the  cancer  terms  specific  by  using  a  list  of 
common  cancers  (e.g.,  lung  cancer,  breast  cancer, 
stomach  cancer,  etc.) .  We  made  no  attempt  to  make 
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Figure  3 

TOPIC2  Relevant  vs.  Median  -  Adhoc  Topics 
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Figure  4 

Example  of  Poorly  Specified  Search 
Routing  Query  #66 
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Figure  5 

Poorly  Specified  Search  -  Cosmetic  Repair 
Routing  Query  #66 
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this  list  exhaustive.  This  produced  the  following 
results: 

Relevant  =  55 

Rel_ret=  17 

R-Precision  =  0.2182 
Thus  we  reduced  the  recall,  but  increased  the  precision. 
Presumably  by  adding  more  specific  cancers  (or  at  least 
the  ones  that  statistically  are  most  common)  we  could 
have  improved  the  recall  here. 

The  second  problem  is  more  severe  though.  It  appears 
impossible  to  build  any  kind  of  model  that  would  allow 
us  to  determine,  with  any  kind  of  confidence,  that  the 
person  who  has  died  is  a  US  citizen.  In  our  revised 
results  list  we  find  many  prominent  persons  who  died  of 
a  named  cancer  but  who  are  not  US  citizens  (e.g.,  the 
Venezuelan  Ambassador). 

In  addition,  the  notion  of  prominence  is  also  hard  to 
capture.  Of  course,  we  might  argue  that  anyone  who's 
obituary  is  on  the  wire  service  is  prominent  by 
definition!  Be  that  as  it  may,  we  observed  a  number  of 
documents  that  we  did  not  retrieve  because  we  had  not 
included  the  specific  prominent  role  indicator  in  our 
Topic.  Thus  we  added  the  following  roles  words  - 
"author",  "poet",  "writer",  "artist",  "painter"  -  to  the 
Topic  and  got  the  following  results: 

Relevant  =  55 

Rel.ret  =  33 

R-Precision  -  .0909 
Thus  we  improved  the  recall  but  at  the  expense  of  the 
precision  again.  Notice  that  we  still  have  not  included 
any  business  or  government  roles,  which  presumably 
would  help  retrieve  the  relevant  documents  in  the  WSJ 
corpus. 

Our  conclusion,  is  that  this  is  a  significant  challenge  for 
Topic,  and  all  other  system.  The  citizenship  question 
often  cannot  be  resolved  by  reference  to  the  text  alone, 
and  we  see  no  alternative  but  accept  the  false  hits. 
Prominence  is  also  difficult,  but  could  conceivably  be 
approached  by  an  extensive  list  of  prominence  and  role, 
words.  The  specific  cancer  seems  tractable  since  there 
are  only  a  finite  number  of  cancers  and  just  a  small  set 
of  those  are  common. 

3.4.2.3  AD  HOC  TOPIC  133 

A  relevant  document  for  this  topic  must  describe  some 
design  feature  of  the  Hubble  Space  Telescope,  but  must 
not  report  of  the  launch  activity  itself  nor  the  Hubble 
Constant  or  Edwin  Hubble. 

The  official  Topic  was  essentially  a  simple  structure  of 
the  form:  Hubble  Space  Telescope  and  not  launch  and 
not  Edwin  Hubble.  This  gave  the  following  results: 

Relevant  =  80 

ReLret  =  29 

R-Precision  =  0.3625 


which  is  surprisingly  poor  given  the  apparent  simplicity 
of  the  topic. 

Analysis  of  the  behavior  of  the  negation  function  in 
Topic  shows  that  it  is  too  restrictive,  and  so  we 
eliminated  the  negated  concepts  leaving  just  the  phrase 
"Hubble  Space  Telescope".  Using  this  as  the  query 
gave: 

Relevant  =  80 
ReLret  =  78 
R-Precision  =  0.6000 
which  would  have  been  above  median  and  close  to  best. 

Adding  as  disjuncts  (OR)  the  words  "Hubble"  and  "HST" 
gave: 

Relevant  =  80 

ReLret  =  79 

R-Precision  =  0.6000 
that  is  we  retrieved  one  extra  relevant  document  with  no 
decrease  in  precision. 

We  conclude  that  although  the  information  need 
statement  is  careful  to  spell  out  the  cases  where  the 
document  will  be  non-relevant,  the  TREC  corpus  has 
few  documents  where  these  conditions  apply,  so  that  a 
simple  query  performs  very  well.  This  is  presumably 
the  approach  most  sites  took. 

4 .       FINAL  OBSERVATIONS  FROM 
TREC-2 

The  TREC-2  topic  descriptions,  particularly  the  ad  hoc 
topics,  exceed  the  level  of  domain  knowledge  available 
to  most  users  of  heterogeneous  document  collections. 

Most  Topic  (content-based)  search  operational  users  are 
driven  by  time  pressures  to  locate/summarize  the  most 
relevant  details  in  the  fewest  possible  documents.  The 
exhaustive  search  result  analysis  implied  by  examining 
hundreds  of  relevant  documents  will  not  be  addressed  in 
most  user  environments;  our  experience  is  that  ten  to 
thirty  documents  is  the  level  of  search  result  analysis 
performed  by  a  user  (unless  significant  duphcation  of 
material  occurs  earlier,  which  would  reduce  the  number 
of  documents  actually  analyzed).  Ergonomically,  high 
precision  in  the  first  (10, 20. ..50)  documents  is  more 
likely  to  keep  users  attracted  than  high  recall  at  much 
larger  counts. 

Although  we  have  yet  to  perform  any  analysis  of 
duplicate  information  on  the  TREC2  results,  our  belief 
is  that  duplicate  data  is  plentiful  in  the  TREC2  "relevant 
lists",  and  that  the  reading  of  duphcate  data  by  the 
human  user  will  cause  the  result  analysis  to  be 
(prematurely)  terminated. 

We  are  certain  that,  unless  summarization  is  performed, 
the  relevant  search  results  on  most  topics  are  too 
numerous  to  warrant  user  attention.  It  would  seem 
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reasonable  to  examine,  at  least  for  selected  topics, 
whether  the  first  ten/best  ten  documents  address  the 
domain  well  from  a  domain  "precision/recall" 
perspective.  To  the  extent  that  the  domain  is  well  served 
in  a  few  representative  documents,  the  coverage  in  the 
representative  documents  may  be  a  "better"  answer  for 
the  user  than  the  numerical  count  of  the  number  relevant 
in  the  first  1000.  We  recommend  adding  a  measurement 
of  the  coverage  of  the  domain  as  the  first  ten/thirty/n 
results  documents  are  examined. 
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Appendix  A 

COMPANY  AND  PRODUCT  SUMMARY 

Topic  is  a  commercial  off  the  shelf  software  product  line 
available  from  Verity,  Inc.  Topic  search  technology  is  a 
commercial  adaptation  of  ideas  extracted  from  the 
research  of  Tong,  McCune  et.  al.,  in  Rule-Based 
Information  Retrieval,  which  was  sponsored  by  the  U.S. 
Intelligence  Community.  Topic  supports  cataloguing, 
indexing  and  retrospective  search  of  fixed  collections, 
automatic  search  of  newly  indexed  documents  according 
to  (user)  predefined  search  rules  (profiles),  and 
dissemination/notification  based  upon  satisfied  search 
rules.  Documents  may  be  batched  for  indexing/profiling, 
or  processed  automatically  as  they  arrive. 

The  Verity,  Inc.  market  presence  in  content-based  text 
search/retrieval  is  described  in  the  Delphi,  Inc.  1992 
Industry  Summary.  The  Verity  Topic  product  line  is 
considered  to  have  in  excess  of  a  ten  percent  share  of  the 
market  in  commercial-off-the-shelf  content-based 
search/retrieval  products  for  personal  computer  to 
minicomputer  environments. 

Verity  was  founded  in  April  1988.  The  Topic  product 
was  first  licensed  and  installed  by  the  U.S.  Air  Force  in 
June  1987.  Verity  currently  has  over  650  installafions 
and  some  30,000  users.  Many  thousands  of  persons  have 
received  training  from  Verity  on  the  Topic  products. 
Approximately  one-third  of  Verity's  installed  base  uses 
an  event-driven  or  batch  automatic-search-notification 
function. 

Many  organizations  use  the  routing  mechanism  for  users 
who  are  unable  to  compose  the  (appropriate)  queries,  but 
require  the  expert's  result  quality. 

The  Topic  product  line  supports  nearly  twenty  varieties 
of  the  UNIX  operating  environment,  VMS,  OS2,  DOS 
and  Macintosh.  The  product  operates  on  data  stored  in 
the  filesystem  or  in  any  SQL-based  data  base 
management  system.  The  product  as  shipped  supports 
over  twenty  formats  of  native  data  (markup  languages), 
and  provides  the  ability  to  insert  local/third  party  markup 
language  interpreters  as  required.  A  document  in  Topic  is 
logical,  and  may  be  a  file,  subfile  or  any  logical 
decomposition  of  a  physical  native  document. 

The  Topic  end  user  (search)  product  is  available  in 
MSWindows,  Presentation  Manager,  X- Windows-Motif, 
Macintosh,  and  character  (keyboard/terminal)  interface 
styles.  There  is  a  4GL-like  command  interpreter 
language  for  rapid  application  development  and  remote 
command  line  interactive  index/search.  There  is  an 
Application  Program  Interface  (C-  library)  to  all  Topic 
functions  for  embedded  applications. 
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Abstract 

Experiments  performed  on  small  collections  suggest 
that  expanding  query  vectors  with  words  that  are 
lexically  related  to  the  original  query  words  can  im- 
prove retrieval  effectiveness.  Prior  experiments  using 
WordNet  to  automatically  expand  vectors  in  the  large 
TREC-1  collection  were  inconclusive  regarding  effec- 
tiveness gains  from  lexically  related  words  since  any 
such  effects  were  dominated  by  the  choice  of  words  to 
expand.  This  paper  specifically  investigates  the  effect  of 
expansion  by  selecting  query  concepts  to  be  expanded 
by  hand.  Concepts  are  represented  by  WordNet  syn- 
onym sets  and  are  expanded  by  following  the  typed 
links  included  in  WordNet.  Experimental  results  sug- 
gest that  this  query  expansion  technique  makes  little 
difference  in  retrieval  effectiveness  within  the  TREC  en- 
vironment, presumably  because  the  TREC  topic  state- 
ments provide  such  a  rich  description  of  the  information 
being  sought. 

1  Introduction 

The  IR  group  at  Siemens  Corporate  Research  is  in- 
vestigating how  concept  spaces  —  data  structures  that 
define  semantic  relationships  among  ideas  —  can  be 
used  to  improve  retrieval  effectiveness  in  systems  de- 
signed to  satisfy  large-scale  information  needs.  As  part 
of  this  research,  we  expanded  document  and  query  vec- 
tors automatically  using  selected  synonyms  of  origi- 
nal text  words  for  TREC-1  [5].  The  retrieval  results 
indicated  that  this  expansion  technique  improved  the 
performance  of  some  queries,  but  degraded  the  perfor- 
mance of  other  queries.  We  concluded  that  improving 
the  consistency  of  the  method  would  require  both  a  bet- 
ter method  for  determining  the  important  concepts  of 
a  text  and  a  better  method  for  determinmg  the  correct 
sense  of  an  ambiguous  word. 

We  took  TREC-2  as  an  opportunity  to  investigate 
the  effectiveness  of  vector  expansion  when  good  con- 
cepts are  chosen  to  be  expanded.  As  in  TREC-1,  query 
vectors  were  expanded  using  WordNet  synonym  sets. 


However,  the  synonym  sets  associated  with  each  query 
were  selected  manually  (by  the  author).  These  results 
therefore  represent  an  upper-bound  on  the  effectiveness 
to  be  expected  from  a  completely  automatic  expansion 
process. 

The  results  of  the  TREC-2  evaluation  indicate  that 
the  query  expansion  procedure  used  does  not  signifi- 
cantly affect  retrieval  performance  even  when  impor- 
tant concepts  are  identified  by  hand.  Some  expanded 
queries  are  more  effective  than  their  unexpanded  coun- 
terparts, but  for  other  queries  the  unexpanded  version 
is  more  effective.  In  either  case,  the  effectiveness  differ- 
ence between  the  two  versions  is  seldom  large.  Further 
testing  suggests  that  more  extreme  expansion  proce- 
dures can  cause  larger  differences  in  retrieval  perfor- 
mance, but  the  net  effect  over  a  set  of  queries  is  de- 
graded performance  compared  to  no  expansion  at  all. 

The  remainder  of  the  paper  discusses  the  experiments 
in  detail.  The  next  section  describes  the  retrieval  envi- 
ronment, including  a  description  of  WordNet.  Section  3 
provides  evaluation  results  for  both  the  official  TREC-2 
runs  and  some  additional  supporting  runs.  The  final 
section  explores  the  issue  of  why  the  expansion  fails  to 
improve  retrieval  performance. 


2    The  Retrieval  Environment 

The  expansion  procedure  used  in  this  work  relies 
heavily  on  the  information  recorded  in  WordNet, 
a  manually-constructed  lexical  system  developed  by 
George  Miller  and  his  colleagues  at  the  Cognitive  Sci- 
ence Laboratory  at  Princeton  University  [4].  Word- 
Net's  basic  object  is  a  set  of  strict  synonyms,  called 
a  synset.  Synsets  are  organized  by  the  lexical  rela- 
tions defined  on  them,  which  differ  depending  on  part 
of  speech.  For  nouns,  the  only  part  of  WordNet  used  in 
this  study,  the  lexical  relations  include  antonymy,  hy- 
pernymy/hyponymy  {is-a  relation)  and  three  different 
meronym/holonym  {part-of)  relations.  The  is-a  rela- 
tion is  the  dominant  relationship,  and  organizes  the 
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synsets  into  a  set  of  approximately  ten  hierarchies-^. 
Figure  1  shows  a  piece  of  WordNet.  The  figure  con- 
tains all  the  ancestors  and  descendents  as  defined  by 
the  is-a  relation  for  the  six  senses  of  the  noun  swing. 
Also  shown  is  that  one  of  the  senses,  a  child's  toy,  is 
part- of  a.  playground. 

Given  a  synset,  there  is  a  wide  choice  of  words  to 
add  to  a  query  vector  —  one  can  add  only  the  syn- 
onyms within  the  synset,  or  all  descendents  in  the  is-a 
hierarchy,  or  all  words  in  synsets  one  link  away  from 
the  original  synset  regardless  of  link  type,  etc.  One  of 
the  goals  of  this  work  is  to  discover  which  such  strate- 
gies are  effective.  Wang  et  al.  found  expanding  vectors 
from  relational  thesauri  to  be  effective  [6],  but  based 
those  conclusions  on  experiments  performed  on  one 
small  collection.  Experiments  we  performed  as  part  of 
our  TREC-1  work  showed  showed  serious  degradation 
when  anything  other  than  synonyms  were  used  in  the 
expansion  —  but  the  TREC-1  results  were  dominated 
by  the  problem  of  finding  good  synsets  to  expand.  This 
work  examines  the  effectiveness  of  the  different  relation 
types  assuming  good  synsets  are  used  as  the  basis. 

Siemens'  official  TREC-2  runs  consist  of  one  rout- 
ing run  (topics  51-100  against  the  documents  on  disk 
three)  and  two  ad  hoc  runs  (topics  101-150  against  the 
documents  on  the  first  two  disks).  All  of  the  runs  are 
manual  since  the  input  text  of  the  topics  was  modified 
by  hand.  There  are  two  types  of  modifications:  parts 
of  the  topic  statement  that  explicitly  list  things  that 
are  not  relevant  were  removed,  and  synsets  containing 
nouns  germane  to  the  topic  statement  were  added  as  a 
new  section  of  the  topic  text.  Document  text  was  in- 
dexed completely  automatically  (once  the  errors  were 
fixed^)  using  the  standard  SMART  indexing  routines  [1] 
(i.e.,  tokenization,  stop  word  removal,  and  stemming). 
In  general,  only  the  "text"  fields  of  the  documents  were 
indexed.  For  example,  only  the  title,  abstract,  detailed 
claims,  claims,  and  design  claims  sections  were  indexed 
for  the  patent  sub  collection.  The  manually  assigned 
keywords  included  in  some  of  the  Ziff  documents  were 
not  used,  nor  were  the  photograph  captions  of  the  San 
Jose  Mercury  collection. 

The  goal  in  selecting  synsets  to  be  included  in  a  topic 
statement  was  to  pick  synsets  that  emphasized  impor- 
tant concepts  of  the  topic.  One  aspect  of  the  prob- 
lem is  sense  resolution:  selecting  the  synset  that  con- 
tains the  correct  sense  of  an  ambiguous  original  topic 
word.  However,  since  one  purpose  of  the  experiments 

^The  actucd  structtire  is  not  quite  a  hieraxchy  since  a  few 
synsets  have  more  than  one  parent. 

^There  were  seven  errors  total  in  files  patn_014  and  patnJOSl 
that  were  not  on  the  official  hst,  but  caused  the  files  to  not  con- 
form to  the  patent  collection's  readme  file.  These  errors  —  miss- 
ing '/TEXT'  tags  ,  'TEXT'  tags  preceding  'OREF'  tags,  and  the 
like  —  were  also  fixed  majiually. 


is  to  investigate  how  effective  lexical  relations  are  in 
expanding  queries  assuming  good  starting  concepts,  I 
did  not  restrict  myself  to  adding  only  synsets  that  con- 
tain some  original  topic  word.  For  example,  topic  93 
asks  for  information  about  the  support  of  tke  NRA 
and  never  mentions  the  word  gun.  Nevertheless,  I  be- 
lieved gun  to  be  an  important  concept  of  the  topic  and 
added  the  synset  containing  gun  meaning  "a  weapon 
that  discharges  a  missile  from  a  metal  tube  or  barrel" 
to  the  topic.  {Rifle,  a  word  that  does  appear  in  the 
topic  statement,  is  a  grandchild  of  this  synset  in  Word- 
Net,  with  the  intervening  synset  being  {firearm,  piece, 
smaU-arm}).  Synset  selection  was  also  influenced  by 
the  fact  that  these  synsets  would  be  used  to  expand 
the  query.  Early  experiments  demonstrated  that  ex- 
pansion worked  poorly  when  synsets  with  very  many 
children  in  the  is-a  hierarchy  (e.g.  couniry)  were  used, 
so  those  synsets  were  avoided.  Furthermore,  when  se- 
lecting one  sense  among  the  different  senses  in  WordNet 
was  difficult,  I  frequently  used  the  words  related  to  the 
synsets  as  a  way  of  making  a  decision.  Figure  2  shows 
the  original  text  of  topic  93  and  the  synsets  that  were 
added  to  it. 

Some  topics  contained  important  concepts  that  had 
no  corresponding  synset.  Occasionally,  the  missing 
synset  was  a  gap  in  WordNet;  for  example,  toxic  waste, 
genetic  engineering,  and  sanctions  meaning  economic 
disciplinary  measures  are  not  in  version  1.3  of  WordNet. 
More  often,  the  important  concept  was  a  proper  noun 
or  highly  technical  term  that  one  wouldn't  expect  to  be 
in  WordNet.  NRA  oi  National  Rifle  Association,  for 
example,  is  an  important  concept  for  topic  93  but  does 
not  occur  in  WordNet.  Nothing  was  added  to  the  topic 
texts  for  concepts  that  lacked  corresponding  synsets  in 
these  experiments,  although  making  some  provision  for 
them  would  improve  retrieval  performance. 

Once  the  text  of  the  topics  is  annotated  with  synsets, 
the  remainder  of  the  processing  is  automatic.  Selected 
fields  of  the  topic  statements  (the  title,  nationality,  nar- 
rative, factors,  description,  and  concept  fields)  are  in- 
dexed using  the  standard  SMART  routines.  The  terms 
derived  from  these  sections  are  "original  query  terms" . 
The  expansion  procedure  is  invoked  when  the  synonym 
set  section  is  reached.  The  procedure  is  controlled  by 
a  set  of  parameters  that  specifies  for  each  relation  type 
included  in  WordNet  the  maximum  length  of  a  chain 
of  that  type  of  link  that  may  be  followed.  A  chain 
begins  at  each  synset  listed  in  the  synset  section  of 
the  topic  text  and  may  contain  only  links  of  a  sin- 
gle type.  All  synonyms  contained  within  a  synset  of 
the  chain  are  added  to  the  query.  Collocations  such 
as  change-ofJocation  in  Figure  1  are  broken  into  their 
component  words,  stop  words  such  as  o/are  removed, 
and  the  remaining  words  are  stemmed.  The  word  stems 
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abstraction 
attribute 


act 

human_activity 
human  action 


entity 


attribute 


activity 
behavior 


change 


property 


ammation 
liveliness 
motion 
movement 


sound_property 


maneuver 
play 


rhythm 


stroke 


swmg 
lilt 


goit_stroKe 
swing 
shot 
golf_shot 


diversion 
recreation 

music 

dance 
music 
danceroom_music 
ballroom_music 

object 
inanimate_object 
physical_object 
thing 


change_of_location 
motion 
movement 


artifact 
article 
artefact 


swing 
swinging 


instrumentality 


device 


swing 
baseball_swing 


slice 


drive 


hook 


approach 
approach_shot 


putt 


IS-Alink 
PART  link 


mechanical  device 


plaything 
toy 


swing 

jive 

playground 


chip 

pitch 

chip_shot 

pitch_shot 

Figure  1:  Relations  defined  for  the  six  senses  of  the  noun  swing  in  WordNet. 
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<title>  Topic:  What  Backing  Does  the  National  Rifle  Association  Have? 
<desc>  Description: 

Dociunent  must  describe  or  identify  supporters  of  the  National  Rifle 
Association  (NRA) ,  or  its  assets. 

<ncirr>  Narrative: 

To  be  relevant,  a  document  must  describe  or  name  individuals  or  organizations  who  are 

members  of  the  NRA,  or  who  contribute  money  to  it.     A  document  is  also  relevant 

if  it  qucintifies  the  NRA's  finsincial  assets  or  identifies  ciny  other  NRA  holdings. 

<con>  Concept ( s ) : 

1.  National  Rifle  Association,  NRA 

2.  contributor,  member,  supporter 

3.  holdings,  assets,  fineinces 

<syn> 

{funds,  finance,  monetary_resource ,  cash_in_hajid,  pecuniary_resource} 

{supporter,  protagonist,  champion,  admirer,  booster} 

{gun} 

Figure  2:  Topic  093  and  the  synonym  sets  selected  for  it. 


plus  a  tag  indicating  the  lexical  relation  through  which 
the  stems  are  related  to  the  original  synset  are  then 
appended  to  the  original  query  terms. 

As  an  example  of  the  expansion  process,  consider  the 
synsets  for  swing  shown  in  Figure  1 .  If  the  synset  added 
to  the  topic  is  the  synset  containing  golfsiroke,  and 
any  number  of  hyponym  (child)  links  may  be  traversed, 
then  the  stems  of  golf,  stroke,  swing,  shot,  slice,  hook, 
drive,  putt,  approach,  chip,  and  pitch  would  be  added 
to  the  query  vector.  If  hyponym  chains  are  limited  to 
length  one,  then  chip  and  pitch  would  not  be  added. 
If  the  synset  added  to  the  topic  is  the  one  containing 
swing  meaning  plaything  and  any  link  type  may  be  fol- 
lowed for  one  link,  then  the  stems  of  swing,  mechanical, 
device,  plaything,  toy,  playground,  and  trapeze  would  be 
added  to  the  query. 

Stems  added  through  different  lexical  relations  are 
kept  separate  using  the  extended  vector  space  model 
introduced  by  Fox  [3].  Each  query  vector  is  comprised 
of  subvectors  of  different  concept  types  (called  ctypes) 
where  each  ctype  corresponds  to  a  different  lexical  re- 
lation. A  query  vector  potentially  has  eleven  ctypes: 
one  for  original  query  terms,  one  for  synonyms,  and 
one  each  for  the  other  relation  types  contained  within 
the  noun  portion  of  WordNet  (each  half  of  a  symmet- 
ric relation  has  its  own  ctype).  An  original  query  term 
that  is  a  member  of  a  synset  selected  for  that  query 
appears  in  both  of  the  respective  ctypes.  Similarly,  a 
word  that  is  related  to  a  synset  through  two  different 
relations  appears  in  both  ctypes. 


The  similarity  between  a  document  vector  D  and  an 
extended  query  vector  Q  is  computed  as  the  weighted 
sum  of  the  similarities  between  D  and  each  of  the 
query's  subvectors: 


sim{D,Q)=  ai{DQi) 
ctype  i 


where  •  denotes  the  inner  product  of  two  vectors,  Qi 
is  the  ith  subvector  of  Q,  and  ai,  a  real  number,  re- 
flects the  importance  of  ctype  i  relative  to  the  other 
ctypes.  Terms  in  documents  vectors  are  weighted  us- 
ing the  Inc  weights  suggested  by  Buckley  et  al.  [2];  that 
is,  the  weight  of  a  term  is  set  to  1.0  +  ln(i/)  where  tf  is 
the  number  of  times  the  term  occurs  in  the  document 
and  is  then  normalized  by  the  square  root  of  the  sum 
of  the  squares  of  the  weights  in  the  vector  (cosine  nor- 
malization). Query  terms  are  weighted  using  ItN:  the 
log  term  frequency  factor  above  is  multiplied  by  the 
term's  inverse  document  frequency,  and  the  weights  in 
the  ctype  representing  original  query  terms  are  normal- 
ized by  the  cosine  factor.  Weights  in  additional  ctypes 
are  normalized  using  the  length  computed  for  the  orig- 
inal terms'  ctype.  This  normalization  strategy  allows 
the  original  query  term  weights  to  be  unaffected  by  the 
expansion  process  and  keeps  the  weights  in  each  ctype 
comparable  with  one  another. 
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3  Experiments 

The  training  data  for  the  routing  queries  was  used  both 
to  refine  the  synsets  that  were  included  in  the  topic  text 
and  to  select  the  type  of  relations  used  to  expand  the 
query  vectors.  In  some  cases  a  synset  that  appears 
to  be  a  logical  choice  for  a  query  is  nonetheless  detri- 
mental. For  example,  adding  the  synset  for  death  to 
topic  59  (weather  fatalities)  causes  the  query  to  re- 
trieve far  too  many  articles  reporting  on  deaths  that 
have  no  relation  to  the  weather.  I  produced  five  diflfer- 
ent  versions  of  synset-annotated  topic  texts,  although 
the  differences  between  versions  are  not  very  large.  The 
version  used  in  the  official  routing  run  added  an  average 
of  2.9  synsets  to  a  topic  statement,  with  a  minimum  of 
0  synsets  added  and  a  maximum  of  6  synsets  added. 

Of  course,  the  utility  of  a  synset  depends  in  part  on 
how  that  synset  is  expanded  and  the  relative  weights 
given  to  the  different  link  types  (the  a's  in  the  similar- 
ity function  above).  Table  1  lists  the  various  combina- 
tions that  were  evaluated  using  the  training  data.  Four 
different  expansion  strategies  were  tried:  expansion  by 
synonyms  only,  expansion  by  synonyms  plus  all  descen- 
dents  in  the  is- a  hierarchy,  expansion  by  synonyms  plus 
parents  and  all  descendents  in  the  is-a  hierarchy,  and 
expansion  by  synonyms  plus  any  synset  directly  related 
to  the  given  synset  (i.e.,  a  chain  of  length  1  for  all  link 
types).  Different  a  values  were  also  investigated.  As- 
suming original  query  terms  are  more  important  than 
added  terms,  the  a  for  the  original  terms  subvector  was 
set  to  one  and  the  a  for  other  subvectors  varied  between 
zero  and  one  as  shown  in  Table  1. 

The  most  effective  run  was  the  one  that  expanded  a 
query  synset  by  any  synset  directly  related  to  it  and  had 
a  =  .5  for  all  added  subvectors.  Therefore,  this  strategy 
was  used  to  produce  the  official  routing  queries  from  the 
final  version  of  the  annotated  text.  The  scheme  added 
an  average  of  24.7  words  to  a  query  vector  (minimum 
0,  maximum  70),  and  an  average  of  20.2  (0,  66)  words 
that  are  not  part  of  the  original  text. 

The  average  number  of  relevant  documents  retrieved 
at  rank  100  for  this  run  is  40.7  and  at  rank  1000  is 
133.3;  the  mean  "average  precision"  is  .2984.  In  gen- 
eral, the  individual  query  results  are  at  or  slightly  above 
the  median,  with  a  few  queries  significantly  below  the 
median.  Of  more  interest  to  this  study  is  how  the  ex- 
panded queries  compare  to  unexpanded  queries.  A  plot 
of  average  recall  versus  average  precision  for  these  two 
runs  is  given  in  Figure  3.  As  can  be  seen,  the  effective- 
ness of  the  two  query  sets  is  very  similar. 

Since  there  was  no  way  to  evaluate  the  relative  ef- 
fectiveness of  different  expansion  schemes  for  the  ad 
hoc  queries,  the  same  same  expansion  scheme  as  was 
used  for  the  official  routing  run  —  chains  of  length  one 
for  any  relation  type  and  all  a's  =  .5  —  was  used  for 


the  ad  hoc  run.  Furthermore,  there  could  be  no  re- 
fining of  which  synsets  to  add,  .so  only  one  version  of 
synset-annotated  text  was  produced.  An  average  of  2.7 
(minimum  0,  maximum  6)  synonyms  Wcis  added  to  an 
ad  hoc  topic  text.  The  expansion  process  added  an  av- 
erage of  17.2  (0,  66)  terms  and  12.8  (0,  55)  terms  that 
are  not  part  of  the  original  text. 

Siemens  actually  submitted  two  ad  hoc  runs.  The 
first  was  the  expanded  queries  with  a's  set  to  0,  a  run 
that  is  equivalent  to  no  expansion  and  is  used  as  a  base 
case.  The  second  Siemens  ad  hoc  run  used  the  .5  a  val- 
ues. A  plot  of  the  effectiveness  of  the  two  ad  hoc  runs 
is  given  in  Figure  4.  The  differences  in  effectiveness  be- 
tween unexpanded  and  expanded  queries  is  even  smaller 
for  the  ad  hoc  queries  than  it  is  for  the  routing  queries. 
The  average  number  of  relevant  documents  retrieved  at 
rank  100  is  46.9  for  both  the  unexpanded  and  expanded 
queries.  The  average  number  of  relevant  documents  re- 
trieved at  rank  1000  is  161.4  for  the  unexpanded  queries 
and  161.3  for  the  expanded  queries.  The  mean  "average 
precision"  is  .3408  and  .3397  respectively. 

A  possible  explanation  for  the  little  difference  made 
by  expanding  the  queries  is  that  the  expansion  param- 
eters used  were  too  conservative.  To  test  this  hypoth- 
esis, additional  runs  were  made  using  the  same  set  of 
synsets  but  allowing  longer  chains  of  links  and/or  using 
greater  relative  link  weights  (the  a's).  Table  2  lists  the 
additional  combinations  tested  using  both  the  ad  hoc 
queries  versus  the  documents  on  disks  one  and  two, 
and  the  routing  queries  versus  the  documents  on  disk 
3.  As  was  the  case  for  the  routing  training  runs,  the 
strategy  used  for  the  official  TREC-2  runs  (all  links  of 
length  one,  a's  =  .5)  was  the  most  effective  expansion 
strategy.  The  more  aggressive  expansion  strategies  did 
make  larger  differences  in  retrieval  effectiveness  com- 
pared to  the  unexpanded  queries,  but  across  the  set  of 
queries  the  aggregate  difference  was  negative.  Hence  it 
is  unlikely  that  the  conservative  expansion  strategy  is 
the  reason  for  the  lack  of  improvement. 

4  Conclusion 

The  experimental  evidence  clearly  shows  this  query  ex- 
pansion technique  provides  little  benefit  in  the  TREC 
environment.  The  most  likely  reason  for  why  this 
should  be  so  is  the  completeness  of  the  TREC  topic  de- 
scriptions. Query  expansion  is  a  recall-enhancing  tech- 
nique and  TREC  topic  descriptions  are  already  large 
compared  to  queries  found  in  traditional  IR  collections. 
Although  most  of  the  expanded  queries  did  have  some 
new  terms  added  to  them,  the  most  important  terms 
frequently  appeared  in  both  the  original  term  set  and 
the  set  of  expanded  terms.  This  had  an  effect  on  the 
relative  weight  of  those  terms  in  the  overall  similarity 
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Expansion  by  synonyms  only 

orig  terms  a 

synonyms  a 

1 

.1 

1 

.3 

1 

.5 

1 

.8 

Expansion  by  synonyms  plus  all  descendents 

orig  terms  a 

synonyms  a 

descendents  a 

1 

.1 

.1 

1 

.3 

.1 

1 

.3 

.3 

1 

.5 

.1 

1 

.5 

.3 

1 

.5 

.5 

1 

.8 

.1 

1 

.8 

.3 

1 

.8 

.5 

Expansion  by  synonyms  plus  parents  and  all  descendents 

orig  terms  a 

synonyms  a 

descendents  a     parents  a 

1 

.1 

.1 

.1 

1 

.3 

.1 

.1 

1 

.3 

.3 

.3 

1 

.5 

.1 

.1 

1 

.5 

.3 

.1 

■      -  -t'--  ■ 

.5 

.3 

.3 

1 

.5 

.5 

.1 

1 

.5 

.5 

.3 

1 

.5 

.5 

.5 

1 

.8 

.1 

.1 

1 

.8 

.3 

.1 

1 

.8 

.3 

.3 

1 

.8 

.5 

.1 

1 

.8 

.5 

.3 

1 

.8 

.5 

.5 

Expansion  by  synonyms  plus  any  directly  related  synset 

orig  terms  a 

synonyms  o? 

other  a 

1 

.3 

.1 

1 

.3 

.3 

1 

.5 

.1 

1 

.5 

.3 

1 

.5 

.5 

1 

.3 

.5 

Table  1:  Combinations  of  expansion  strategies  and  relation  weights  tested. 
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A    unexpanded 

A    expanded 


A 


unexpanded 
expanded 


0.2 


0.4  0.6 
Recall 


0.8 


1.0 


Figure  4:  Effectiveness  of  expanded  versus  unexpanded  ad  hoc  queries. 
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Expansion  by  synonyms  plus  parents  and  all  descendents 


orig  terms  a    synonyms  a    descendents  a    parents  a 

I  .5  .5  .5 

II  .5  .5 
1111 
12  11 


Expansion  by  synonyms  plus  any  directly  related  synset 
orig  terms  a     synonyms  a         other  a 
1  0  0 

1  .5  .5 

1  1  1 


Table  2:  Additional  combinations  of  expansion  strategies  and  relation  weights  tested. 


computed  for  a  document,  especially  when  some  im- 
portant terms  had  no  corresponding  synset.  Topic  112 
provides  an  example  of  this  effect.  The  topic  is  con- 
cerned about  the  world-wide  investment  in  biotechnol- 
ogy. I  added  synonym  sets  for  investment  and  capital 
to  the  topic.  WordNet  does  not  contain  biotechnology, 
although  it  does  contains  biomedtcaLscience.  Thus,  I 
also  added  the  biomedicaLscience  synset  and  a  synset 
containing  gene.  The  resulting  expanded  query  per- 
formed significantly  worse  than  the  unexpanded  version 
(33  relevant  retrieved  in  the  first  100  versus  52  relevant 
retrieved).  The  problem  is  that  the  expanded  query 
places  too  much  emphasis  on  money  and  not  enough 
on  biotechnology.  Thus  these  results  indicate  that  sim- 
ply recognizing  which  are  the  important  concepts  in  a 
query  statement  is  not  sufficient  to  ensure  improved  re- 
trieval performance.  An  expansion  procedure  must  also 
preserve  the  relative  weights  of  those  concepts. 

Another  possible  explanation  is  that  WordNet  is  not 
suited  for  this  task  —  it  was  not  designed  to  be  used  in 
this  manner  and  it  may  not  contain  the  necessary  links. 
Even  if  this  is  true,  however,  it  is  unlikely  that  any  other 
broad-coverage  knowledge  base  would  be  better  suited. 
The  success  of  relevance  feedback  and  other  routing 
techniques  suggests  that  the  most  useful  relations  are 
specific  and  idiosyncratic. 

A  second  goal  of  this  work  was  to  characterize  the 
effectiveness  of  different  types  of  lexical  relations  when 
used  to  expand  a  query.  Assuming  the  set  of  words  to 
be  expanded  is  well  chosen,  any  closely  related  word  — 
regardless  of  the  type  of  relation  —  may  be  a  good 
additional  word.  Wang  et  al.  reached  a  similar  con- 
clusion [6].  Nevertheless,  an  added  word  should  be 
weighted  less  than  the  original  word  that  caused  it 
to  be  included.  Runs  in  which  added  words  were 
equally  or  more  heavily  weighted  than  original  words 
were  consistently  less  effective  than  the  more  conserva- 
tively weighted  runs.  Similarly,  runs  that  added  words 


that  were  loosely  related  to  original  words  (i.e.,  when 
long  paths  of  links  were  followed)  were  consistently  less 
effective  than  runs  that  used  only  near  relatives. 
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Abstract 

We  performed  the  full  experiments,  using  our  network 
implementation  of  component  probabilistic  indexing  and 
retrieval  model.  Documents  were  enhanced  with  a  list  of 
semi-automatically  generated  two-word  phrases,  and  queries 
with  automatic  Boolean  expressions.  An  item  self-learning 
procedure  was  used  to  initiate  network  edge  weights  for 
retrieval.  Initial  results  submitted  were  above  median  for  ad 
hoc,  and  below  median  for  routing.  They  were  not  up  to 
expectation  because  of  a  bad  choice  of  high-frequency  cut- 
off for  terms,  and  no  query  expansion  for  routing.  Later 
experiments  showed  that  our  system  does  return  very  good 
results  after  correcting  the  earlier  problems  and  adjusting 
some  parameters.  We  also  re-design  our  system  to  handle 
virtually  any  number  of  large  files  in  an  incremental 
fashion,  and  to  do  retrieval  and  learning  by  initiating  our 
network  on  demand,  without  first  creating  a  full  inverted 
file. 


1.  Introduction 

In  TRECl  our  system  called  PIRCS  (acronym  for 
Probabilistic  Indexing  and  Retrieval  -Components-  System) 
took  part  as  Category  B  participant,  handling  only  the  0.5 
GB  Wall  Street  Journal  collection  because  both  our 
software  and  hardware  were  not  sufficient  for  the  full  set  of 
text  files.  In  TREC2,  we  participated  in  category  A. 
However,  during  a  large  portion  of  the  time  period  we  have 
to  face  fairly  uncertain  and  sometimes  difficult  conditions. 
Plans  to  install  a  dedicated  SPARC  10  workstation  and 
associated  large  memory  and  disk  drives  did  not  materialize 
until  about  three  weeks  from  the  deadline.  Before  this 
period,  the  SPARC2  woricstation  that  we  have  been  using 
was  also  shared  with  other  users  during  the  semester. 
Certain  things  that  we  wished  to  do  were  not  done,  and 
comers  were  cut  to  fit  programs  and  data  in  the  existing 
system.  Much  of  our  time  was  spent  revamping  our 
software  to  be  more  efficient  in  space  and  time  utilization. 
Our  focus  remains  as  in  TRECl,  namely,  to  improve 
representations  of  documents  and  queries,  to  test  different 


learning  methods  and  to  combine  different  retrieval  methods 
to  improve  final  ranked  retrieval  output.  Section  2 
summarizes  our  retrieval  network;  Section  3  discusses  our 
improved  system  design;  Section  4  is  on  item 
representation;  Sections  5&6  are  about  our  learning  and 
redieval  procedures;  Section  7  discusses  the  results  we 
submitted  and  Section  8  contains  results  of  our  later 
experiments.  Section  9  follows  with  the  conclusion. 

2.  A  Retrieval  Network  in  PIRCS 

Our  retrieval  process  is  based  on  a  three  layer  Q-T-D 
(Query-Term-Document)  network,  details  of  which  are 
given  in  [KwPK93,Kwok9x].  Here  we  give  a  review. 
From  Fig.l,  DTQ  query-focused  retrieval  means  spreading 
an  initial  activation  of  1  (one)  from  a  document  d;  towards 
query  q^  and  gated  by  intervening  edges  w,,;  and  w^^.  The 
resultant  activation  received  at  q,  is  Wj/q  =  1^  w^*Wy,  and 
is  the  retrieval  status  value  (RSV)  of  d;  with  respect  to  q,. 
When  activation  initiated  at  q^  spreads  towards  dj,  we  obtain 
activation  received  at  dj  equals  to  Wy^  =  Wii,*w,„,  and  is 
our  QTD  document-focused  RSV  for  d;  .  Combining  the 
two  additively:  Wj  =  Wj/^  +  Wj/j  gives  our  basic  retrieval 
ranking  function.  Edge  weights  w,^,  w^  represent  items  (q, 
or  di)  acting  on  terms  t^  and  reflect  usage  of  terms  within 
items.  Edge  weights  w^,  Wi^  (representing  t^  acting  on  q, 
or  dj)  embed  Bayesian  inference  and  are  initialized  based  on 
a  component  consideration  of  probabilistic  indexing  and 
retrieval.  These  weights  can  improve  via  a  learning  process 
when  relevant  documents  are  known  to  queries  and  vice 
versa:  DTQ  query-focused  training  when  we  know  the  set 
of  documents  and  their  components  relevant  to  a  query,  and 
QTD  document-focused  training  when  we  know  the  set  of 
queries  and  their  components  relevant  to  a  document. 
Query-focused  training  prepares  queries  to  match  new 
similar  documents  better,  while  document-focused  training 
helps  documents  to  match  new  similar  queries  better.  With 
learning  capability,  the  net  behaves  and  can  be  viewed  as  a 
superposition  of  two  2-layer  direct-connect  artificial  neural 
networks,  one  in  each  direction.  If  a  boolean  expression  for 
query  q,  is  known,  it  can  also  be  represented  as  a  tree  and 
hung  onto  the  net  as  shown  in  Fig.2.  Edge  weights  from 
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the  net  are  used  to  initialize  the  tree  leaf  nodes,  from  which        DTQ  direction  resulting  in  a  third  RSV:  S,.  The  RSVs  are 
activation  spreads  to  the  query  root  node.  Processing  at  the       latter  combined  for  ranking  retrieval  outputs, 
nodes  implements  soft-Boolean  evaluation  [SaFW83]  in  the 


3.  System  Design 

Our  previous  software  for  TRECl  has  extraneous 
processing  that  produces  several  intermediate  files  for  other 

PreProcss  — >  Create  — >  Initiate 
Documents        Direct  files  I  Network 

I 

/l\ 

PreProcss  — >         "   1 

Queries  I 

/l\ 
I 

Relevance  File  — >  

In  addition,  we  also  took  the  opportunity  to  re-design  our 
system  resulting  in  several  ianovative  ^proaches  described 
in  die  following  subsections: 

3.1  Full  Inverted  File  Eliminated 

It  is  useful  to  view  a  texd)ase  as  a  document  by  term 
matrix.  If  one  stores  the  matrix  rowwise,  we  call  it  a  direct 
file,  in  contrast  to  an  inverted  file  which  stores  die  matrix 
columnwise.  The  inverted  file  is  useful  to  support  fast 
retrieval  while  the  direct  file  is  useful  for  feedback  learning 
and  query  expansion  when  given  certain  documents  being 
relevant  to  certain  queries.  In  addition,  die  raw  textfile  is 


purposes.  These  consume  a  lot  of  disk  space  for  large  scale 
collections.  We  spent  a  substantial  amount  of  rime  to 
revamp  our  system,  resulting  in  a  more  streamlined  flow- 
chart as  follows: 


useful  for  display  purposes  after  a  retrieval  ranked  list  is 
produced  K  one  assumes  that  each  of  diese  three  files  are 
approximately  equal  in  size  of  N  bytes,  we  need  a  minimum 
of  3N  bytes,  which  is  quite  substantiaL  Removmg  the  raw 
text  may  not  win  user  support  during  display  unless  die 
direct  file  encodes  all  stopwords.  punctuations  and 
paiagrqih  structines  of  die  originaL  Most  systems  delete 
die  direct  file  and  re-produce  a  subset  of  it  fiom  raw  text 
when  needed  This  results  in  a  requirement  of  2N  bytes. 
We  choose,  however,  to  keep  the  direct  file  and  produce  our 
network  widi  respect  to  the  queries  dynamically  as  needed 
widioot  first  producing  an  inverted  file  first  Our  network 
actually  contains  bodi  direct  and  'inverted'  data.  It  resides 
in  memory  to  support  learning  and  retrieval    By  this 


— >  Network    — >  Retrieve  — >  Evaluate 
Learning         and  Rank 
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approach  we  achieve  several  objectives  for  our  system:  a) 
reduce  the  'dead'  time  between  a  collection  being  acquired 
and  its  availability  for  searching,  since  die  full  inverted  file 
is  not  produced;  b)  satisfy  the  2N  bytes  of  space 
requirement;  c)  support  fast  feedback  learning  and  query 
expansion  by  having  the  direct  file  available.  The  price  we 
pay  is  to  have  to  create  the  network  dynamically,  unlike  the 
full  inverted  file  which  is  produced  once.  However,  since 
the  full  inverted  file  is  too  large  to  reside  in  memory, 
reading  parts  of  it  in  for  retrieval  is  also  time  consuming. 

3.2  Reduced  Network  Size 

We  produce  our  network  in  memory  according  to  the 
queries  under  attention  and  the  terms  used  in  them.  Since 
memory  is  limited,  docimients  are  divided  into 
subcollections  (Section  3.3)  and  queries  are  linked  in  five 
to  ten  at  a  tune.  We  define  the  active  term  set  (ATS)  of  a 
network  as  the  set  of  terms  used  by  the  current  queries  and 
their  feedback  documents  if  any.  The  latter  will  provide 
terms  for  query  expansion.  Only  edges  that  connect  items 
to  the  active  term  set  are  initiated  in  the  network,  reducing 
network  space  requirements  substantially.  Witii  die  current 
implementation  for  1GB  subcoUection  and  7  queries,  the 
node  and  edge  files  together  take  about  40  MB.  Using 
clock  time  as  a  measure,  producing  die  network  requires 
about  40  minutes.  Learning  is  fast,  about  2  minutes,  and 
retrieval  and  ranking  anotiier  8  minutes.  We  hope  tiiat 
further  improvements  in  design  and  faster  hardware  in  the 
future  can  improve  these  figures  substantially. 

3.3  Master  and  SubcoUection  File  Structure 

We  view  the  TREC  experiments  as  document  retrieval  from 
multiple  collections,  but  reporting  retrieval  results  in  one 
single  ranked  list  of  documents  for  each  topic  (query). 
Although  only  four  or  five  collections  such  as  Wall  Street 
Journal,  Associated  Press,  etc.  are  given  in  TREC,  in  reality 
it  could  be  many  more.  We  consider  diree  methods  in  file 
design  to  approach  the  problem: 

a)  A  centralized  approach  where  all  documents  from  all 
collections  are  processed  as  if  fi'om  a  single  document 
stream,  producing  a  centralized  dictionary  containing  full 
usage  statistics  and  a  giant  direct  file.  From  tiiese, 
networks  can  be  initialized.  The  idea  is  simple,  but  it  has 
drawbacks  in  that  eventually  the  direct  file  would  exceed 
file/disk  size  limitations  and  software  has  to  be  designed  to 
handle  data  crossing  file/disk  boundaries.  Moreover,  it  is 
inherentiy  fi'agile  to  create  single  files  of  this  size.  The 
advantage  of  this  approach  is  tiiat  RSVs  calculated  are 
directiy  comparable  for  all  documents  and  a  single  ranked 
list  is  produced  widiout  difficulty. 


b)  An  independent  collections  approach  where  each  source 
collection  of  about  0.5-1.0  GB  say.  forms  a  textbase  with  its 
own  local  dictionary  and  direct  file,  for  network  initiation 
and  retrieval.  One  simply  repeats  the  process  for  as  many 
collections  as  necessary.  This  is  the  preferred  approach, 
and  if  one  has  n  processors  and  sufficient  disk  space,  n 
separate  textbases  can  be  created  for  learning  and  retrieval 
in  parallel  saving  substantial  time.  The  problem  is  how  to 
combine  the  retrieval  lists  from  each  mto  a  single  ranked 
list,  since  each  textbase  has  its  own  term  usage  statistics  and 
calculates  RSVs  for  ranking  within  its  own  environment. 
Classical  Boolean  retrieval  and  coordinate  matching  pose  no 
problem.  Some  retrieval  strategy  may  produce  RSVs  that 
are  comparable  across  collections  in  theory;  but  after 
approximations  are  taken,  it  is  questionable  that  this  is  still 
true.  Similar  problem  exists  for  retrieval  from  distributed 
databases  such  as  tiie  WAIS  environment. 

c)  For  TREC2  we  settie  on  a  hybrid  subcollections 
approach,  treating  each  source  as  a  subcoUection  widiin  a 
master.  We  create  a  master  centraUzed  dictionary  as  in  a) 
capturing  full  usage  statistics  serving  ail  the  subcollections. 
and  create  separate  direct  files  for  each  subcoUection  as  in 
b).  The  central  dictionary  has  about  620.000  unique  terms 
after  processing  2  GB  fiom  Diskl  and  Disk2,  and  is 
relatively  small.  It  captures  global  term  usage  statistics, 
while  the  individual  direct  files  capture  local  usage  statistics 
within  items.  Separate  networks  are  then  created  for  each 
subcoUection  with  edge  weights  based  on  the  correct  global 
and  local  statistics  as  in  a),  assuring  that  retrieval  lists 
contain  RSVs  that  are  directiy  comparable.  This  approach 
combines  the  advantages  of  botii  a)  and  b),  and  can  also 
function  in  a  paraUel  distributed  environment. 

4.  Item  Representation 

As  in  TRECl.  a  number  of  preprocessing  mainly  for  the 
purpose  of  improving  the  representations  of  documents  and 
queries  are  done  as  foUows: 

4.1  Vocabulary  Control 

In  addition  to  a  manual  stopword  list  of  about  630  words 
and  another  528  manually  identified  2-word  phrases,  we 
also  process  samples  from  Diskl  and  Disk2  using  aU  five 
source  types  (WSJ.  AP,  FR.  ZIFF  and  DOE.  of  about 
1(X)MB  each)  to  produce  two- word  phrases  based  on 
adjacency  within  sentence  context.  Our  objective  is  to 
remedy  losses  of  recaU  and  precision  due  to  the  removal  of 
high  frequency  terms.  Our  criteria  for  phrases  is  that  each 
word  pair  must  have  a  frequency  of  40,  and  either  one  or 
both  components  must  be  high  frequency  (>= 10000).  A 
casual  scan  of  tiie  resultant  Ust  generated  led  us  to  remove 
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some  obvious  non-content  phrases  (such  as  'author  describ', 
paper  attempt',  'two  way').  The  result  is  13.787  pairs. 
These  plus  the  previously  prepared  manually  list  (which 
contains  mostly  of  phrases  with  at  least  one  stopword  as 
well  as  some  phrases  identified  in  the  'topics')  are  then 
treated  as  if  they  were  single  index  terms  to  be  identified 
during  document  and  query  processing  [BuSA93].  They 
have  their  own  global  and  local  usage  statistics,  and  can 
improve  individual  collection  retrieval  effectiveness  from  a 
few  to  10%. 

After  documents  are  processed,  we  invoke  Zipf  s  law  to 
remove  low  (=4)  and  high  frequency  (=16(X)0)  words  from 
being  used  for  representation.  Low  frequency  terms  lead  to 
too  many  nodes,  and  high  frequency  terms  lead  to  too  many 
edges,  both  consuming  valuable  memory  space. 
Unfortunately,  our  high  frequency  cut-off  of  16000  was  a 
bad  choice.  In  fact,  we  used  high  =  12(X)0  for  routing 
because  at  that  earlier  period,  we  were  short  of  resources. 
The  effect  is  that  our  queries  become  too  short  and  many 
useful  terms  (such  as  'platform'  in  query  #80.  'crimin'  for 
criminal  in  query  #87.  etc.)  are  screened  out.  We  discover 
that  this  is  a  major  factor  in  our  disappointing  results.  Later 
experiments  use  a  high  cut-off  of  500(X), 


4.2  Subdocuments 

As  in  TRECl  we  segment  documents  into  subdocument 
units  to  deal  with  the  problems  of  WSJ  documents  having 
multiple  unrelated  stories,  and  long  documents  in  general. 
A  reaUy  long  document  is  FR89119-01 1 1  which  has  400748 
words.  Our  criteria  is  to  break  documents  either  on  story 
boundaries  or  on  the  next  paragraph  boimdary  after  a  run  of 
360  words  for  all  collections.  We  have  not  foimd 
convenient  story  boundary  indicators  m  other  collections  as 
in  WSJ  (which  uses  three  dashes  ' — '  on  a  line).  With  this 
scheme,  the  total  number  of  subdocuments  from  Disk  1&2 
becomes  1.281,233  compared  with  an  original  742,611. 

After  the  TREC2  deadline,  we  have  the  resources  to 
investigate  the  effects  of  subdocuments  on  retrieval. 
Experiments  were  performed  on  individual  subcollections 
WSJl,  FR2,  FRl  and  AP2,  using  segmiaitation  sizes  of  360, 
720  and  1080  words.  For  WSJl.  we  further  break  on  story 
boundaries  only.  Results  are  tabulated  in  Table  1.  It 
appears  that  for  the  abnormally  long  documents  of  FR. 
breaking  mto  subdocuments  is  definitely  worthwhile, 
achieving  improvements  of  over  20%  compared  with  no 
segmentation.  However,  for  the  newswire  documents  of 
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Table  1:  Document  Segmentation  Avll  Precision  using  50  Queries  Q2 


WSJl  and  AP2.  subdocuments  have  marginal  performance 
effects:  a  little  better  for  large  chunk  sizes,  and  a  little 
worse  for  small  chimks.  It  seems  that  a  chunk  size  of  about 
720-1000  words  would  get  the  benefits  of  both  types.  Using 
different  chunk  sizes  for  different  collections  would 
probably  not  be  worth  the  effort. 

We  like  to  point  out  that  other  than  effectiveness, 
considerations  such  as  isolating  relevant  sections  for  output 


display  and  for  more  precise  feedback  judgment  would  also 
make  document  segmentation  worthwhile.  In  particular,  a 
number  of  long  documents  in  feedback  for  query  expansion 
would  easily  overload  memory  space  in  our  network. 

4.3  Queries 

Topics  are  preprocessed  to  remove  introductory  phrases  and 
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non-content  terms  based  on  word  sequence  patterns.  We 
contuwe  to  use  words  from  the  title,  description,  narrative 
and  concept  sections  of  the  topics  to  form  queries. 
Experiments  without  the  narrative  section  show  slight 
decrease  in  performance  for  oiu:  system  in  contrast  to 
[Gof93].  We  also  try  to  produce  automatically  Boolean 
expressions  as  queries  from  the  description  and  concepts 
sections.  This  is  done  by  using  pimctuations  to  delineate 
phrases  and  ANDing  the  words  within  them.  Phrases  are 
then  ORed.  This  is  a  very  crude  way  of  getting  Boolean 
expressions  for  later  soft-Boolean  retrieval  processing. 


5.  Learning  Procedures 

In  our  network  of  Fig.l  the  edge  weights  determine  the 
retrieval  outcome,  w^j  and  w,.,  captures  the  proportion  of 
term  t^  in  d;  or  q^.  They  are  fixed  and  obtained  from  the 
manifestation  of  terms  in  the  respective  items,  w^  (and  w-j) 
has  a  log  odds  factor,  and  an  Inverse  Collection  Term 
Frequency  (ICTF)  factor  which  is  regarded  as  a  constant  for 
a  term  t|j,  as  follows: 


w.k  =  In  [rjd-rj]  +  ICTF, 


(1) 


Here  is  the  conditional  probability  that  given  relevance 
to  q,  that  term  t,  occurs,  and  needs  to  be  learnt.  It  is 
unknown  unless  one  has  a  sample  of  relevant  documents  to 
q^.  This  is  not  applicable  to  initial  ad  hoc  retrievals  where 
a  document  collection  is  being  processed  against  a  new 
query  with  no  known  results.  Relevance  feedback 
mformation  can  remedy  this  later,  but  it  is  not  available  at 
the  beginning.  One  way  is  to  ignore  the  log  odds  factor  in 
Eqn.1,  as  done  in  our  TliECl  experiments,  resulting  in 
ICTF  weighting.  A  better  way,  which  we  use  for  TREC2, 
is  to  include  item  self-learning  to  determine  r^,  and  initiate 


the  term  weights.  This  is  shown  in  Fig.3  and  is  based  on 
the  following  argument.  Consider  a  document  d;  containing 
certain  concepts  and  topics.  Imagine  this  author  wishing  to 
inquire  the  textbase  for  the  same  topics  as  this  document 
s/he  has  written;  what  query  would  be  most  suitable? 
Naturally  the  author's  own  words  in  die  document  can  serve 
as  the  'query',  and  there  is  also  known  to  be  one  relevant 
item  to  this  'query'  in  the  collection,  viz.  the  document 
itself.  In  other  words,  every  item  is  assumed  to  be  self- 
relevant.  One  relevant  item  is  however  not  sufficient  for 
estimating  r^.  Our  method  is  to  consider  each  document  as 
constituted  of  many  independent  conceptual  components 
each  bemg  described  by  a  list  of  terms.  We  therefore  work 
in  a  component  universe  rather  than  in  the  document 
collection.  Components  can  be  imits  such  as  sentences  ot 
phrases,  but  we  have  used  single  terms  for  simplicity. 
Right  from  the  start  then,  even  without  any  relevance 
feedback,  we  can  divide  the  component  universe  into  two 
parts:  one  set  relevant,  and  the  other  non-relevant  to  each 
the  'query'.  Standard  probabilistic  retrieval  theory  now 
enables  us  to  define  this  'query'  optimally  -  meaning  that 
the  defined  'query'  will  rank  its  set  of  relevant  components 
optimally  with  respect  to  the  other  components  when  used 
for  retrieval.  The  definition  of  this  'query'  (i.e.  its  terms 
and  weights)  becomes  the  initial  indexing  representation  of 
the  document,  and  it  is  used  in  our  ad  hoc  QTD  retrieval. 
This  is  the  principle  of  document  self -recovery  introduced 
in  [Kwok90]  and  implemented  as  a  self -learning  process  in 
a  network  [Kwok89,9x]  shown  in  Fig.4.  One  can  argue  that 
this  relevant  set  of  components  from  one  document  is  too 
small.  But  tiiat  is  all  the  mformation  one  has  at  this  stage. 
Previous  experiments  [Kwok90]  show  that  this  kind  of 
weighting  can  outperform  ICTF  weight  by  a  few  percent. 
Moreover,  our  network  self-learning  parameters  can  be 
adjusted  to  provide  a  smooth  transition  from  ICTF  to  full 
self-leam  weights,  or  any  value  in  between.  We  invoke  the 
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Fig.3    Item  Self-Learn  Using  Its  Own  Self-Relevant  Components 
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same  self-learning  procedure  for  a  query  by  adding  the 
query  to  tiie  collection  as  a  'dociiment'  temporarily, 
resulting  in  self-learn  initial  weights  for  the  indexing 
representation  of  the  query.  This  weight  is  used  in  our  ad 
hoc  DTQ  retrieval.  The  bottom  line  is  that  this  component 
consideration  enables  the  probabilistic  model  to  self- 
bootstrap,  allows  term  frequencies  in  items  to  be  employed 
instead  of  'binary',  and  takes  into  account  of  query -focused 
and  document-focused  retrieval  in  a  cooperative  fashion. 

In  the  case  of  routing  retrieval,  relevant  documents  are 
available  for  training  each  topic  q^.  These  documents  are 
again  considered  as  constituted  of  components,  and  are  used 
in  the  network  to  estimate  r^.  The  estimate  should  be 
better  than  self-learning  because  the  sample  size  of  relevant 
components  is  much  larger.  Our  learning  algoritimi  updates 
the  conditional  probability  r^  (and  v^)  as  follows  (Fig.5): 

Ar^  =  'nQ*(x,  -  r^"*")  (2) 

Here,  tIq  is  a  learning  rate  for  training  on  the  query  side 
and  is  the  average  activation  deposited  on  term  t^  by  tiie 
given  relevant  set.  If  a  term  t,  not  in  the  original  query  is 
highly  activated  by  the  relevant  set,  it  can  also  be  linked  to 
q,  with  edge  weights  given  by: 

w,.  =  a  *  X, 

=  P  *  TIq  *  X,  (3) 

This  implements  query  expansion  as  was  also  done  in 
TRECl. 


6.  Retrieval  Methodology 

To  satisfy  TREC2  requirements,  we  submitted  results  as 
named  in  the  following: 

pircsl:  routing,  no  training; 

pircs2:  routing,  with  training  from  Diskl  relevants, 

Disk2  not  used  and  no  query  expansion; 
pircs3:  ad  hoc,  no  soft-Boolean  added; 
pircs4:  ad  hoc,  with  soft-Boolean  added 

Routing  allows  training  the  old  Q2  topic  set  (topics  51-100) 
before  doing  retrieval  oa  new  DiskS  documents.  DiskS 
term  usage  statistics  are  not  used.  Ad  hoc  retrieval  involves 
using  new  Q3  topic  set  (topics  101-150)  to  do  retrieval  on 
old  documents  in  Diskl  and  Disk2.  All  our  queries  are 
automatically  constructed.  We  did  not  perform  feedback 
experiments  using  Q3. 

For  baseline  routing  (pircsl)  and  ad  hoc  (pircs3)  retrievals, 
we  use  the  item  self-learn  (SL)  edge  weights.  Routing 
pircs2  denotes  retrieval  based  on  further  learning  from 
known  relevant  samples  and  represent  improvements  from 
pircsl.  There  can  be  hundreds  of  known  relevants  for  each 
of  the  Q2  topic  set  from  documents  in  Diskl  and  Disk2  as 
given  from  the  results  of  TRECl.  One  way  of  employing 
them  is  to  do  a  retrieval  (ranking)  of  Diskl  and  Disk2 
documents,  and  then  make  use  of  the  first  n  (say  n=100) 
relevant  documents,  as  for  feedback  learning.  However,  we  I 
did  not  have  enough  resources  to  create  a  network  and  do 
retrieval  (ranking)  oa  2GB  of  documents  at  that  time  and 
have  to  settie  on  a  simplified  strategy.  First,  we  decided  to 
use  Diskl  only  (1  GB)  for  training.  Secondly,  we  believe 
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a  retrieval  (ranking)  of  Diskl  is  still  expensive  and  perhaps 
not  necessary;  rather  we  just  select  'nonbreak'  relevants 
from  Diskl  for  training.  'Nonbreak'  means  documents  that 
do  not  get  split  into  multiple  subdocuments  based  on  our 
criteria  given  in  Section  3.  The  idea  is  that  the  quaUty  of 
documents  for  training  is  important,  and  short  relevants  are 
the  choice.  They  may  not  be  those  ranked  early  during  a 
retrieval.  With  these  simplifications,  a  netwcffk  is  produced 
with  Diskl,  the  query-term  edges  are  trained,  and  then 
stored  for  later  routing  retrieval  using  DiskS. 

Ad  hoc  pircs4  denotes  retrieval  based  on  combming  the 
baseline  pircs3  with  a  soft-boolean  retrieval.  The  pircs4 
ranking  formula  becomes  r*Wi+s*Si  (see  definitions  in 
Sectiai  2).  Our  Boolean  expressions  for  queries  are 
prodiK^  automatically  as  discussed  in  Section  4.3.  and 
edge  weights  are  used  to  initiate  the  leaf  nodes  of  the 
Boolean  expression  tree. 

7.  Discussion  of  Submitted  Results 

From  our  routing  retrieval  table  in  the  master  appendix  of 
this  volume,  it  can  be  seen  that  pircs2  improves  over  pircsl 
by  about  7%  based  on  average  non-interpolated  precision 
(.266  vs  .249)  and  about  3.8%  based  on  relevants  retrieved 
(6135  vs  5913),  showing  that  our  simplified  method  of 
using  only  the  Diskl  'nonbreak'  training  documents  still 
works.  We  did  not  do  a  retrieval  and  rank.  Compared  with 
the  oibsi  sites,  our  restilt  is  below  median  both  using  the 
average  non-interpolated  precision  for  individual  queries  (18 
better,  2  equal  and  30  below  median),  and  using  the 
relevants  retrieved  at  100  documents  (18  better,  8  equal  and 
24  below  median).  If  we  assmne  the  existence  of  an 
overall  'maxi-system'  that  produces  the  best  non- 
interpolated  precision  values  among  all  sites  for  all  50 
queries,  then  its  average  precision  over  all  queries  is  0.5054 
and  8348  relevants  retrieved.  Our  pircs2  achieves  only 
.266/.505  =  52.7%  of  the  average  precision  but  6135/8348 
=  73.5%  of  the  relevants  retrieved. 

From  the  ad  hoc  retrieval  table  in  the  appendix  cf  this 
volume  it  can  be  se&i  that  pircs4.  which  is  pircs3  combined 
with  automatic  soft-Boolean  retrieval,  improves  over  pircs3 
only  by  about  1%.  Processing  time  however  increases 
substantially.  Our  automatic  Boolean  expressions  are 
awlely  formed;  manual  Boolean  queries  may  do  better. 
Compared  with  other  sites,  our  result  is  above  median  both 
using  the  average  non-interpolated  precision  for  individual 
queries  (34  better,  2  equal  and  14  below  median),  and  using 
the  relevants  retrieved  at  100  documents  (36  better,  4  equal 
and  10  below  median).  The  'maxi-system'  has  an  average 
precision  over  all  queries  of  0.4354  and  9021  relevants 
retrieved.  pircs4  adiieves  about  0.298/0.435  =  68.5%  of 
this  best  precision  value  and  7464/9027  =  82.7%  of  the 


relevants  retrieved.  They  are  much  better  than  for  routing. 
It  would  be  most  useful  and  interesting  if  one  can  choose 
the  best  reported  result  for  each  query  before  the  answers 
are  known.  For  these  experiments  our  high  frequency  term 
cut-off  is  16000,  which  is  still  too  low.  The  next  Section 
discusses  oiu:  later  results. 


8.  Further  Experimental  Results 

After  the  TREC2  Conference,  we  decided  to  repeat  both 
experiments.  We  realize  that  our  disappomting  results  are 
due  to  several  factors:  1)  bad  high  fiequency  term  cut-off 
leading  to  insufficient  representation;  2)  no  query  expansion; 
3)  insufficient  training  samples;  and  4)  parameters  need 
tuning.  Except  for  4)  these  are  remedied  as  follows:  high- 
frequency  cut-off  is  set  at  5(X)00,  learning  for  routing  is 
done  from  both  diskl  and  disk2  and  only  documents  that 
'break'  into  six  or  less  subdocuments  are  used,  and  query 
expansion  is  also  done.  The  runs  are  named  in  Table  2  as: 

pircs5:  routing,  wifli  learning  but  no  query  expansion; 
pircs6:  routing,  query  expansicm  level  of  20; 
pircs7:  routing,  'upperbound',  no  expansim; 
pircs8:  ad  hoc  without  Boolean  queries. 

As  in  TRECl,  our  query  expansion  level  of  20  actually  adds 
less  than  20  terms  because  some  of  the  top-ranked  terms 
may  already  appear  in  the  query.  It  can  be  seen  that  results 
are  substantially  better  than  tiiose  in  Section  7.  In 
particular,  pircs6  routing  witii  query  expansion  have  average 
precision  of  0.355  and  die  number  of  relevants  retrieved  are 
7476  out  of  10489.  These  are  12%  and  5%  respectively 
better  than  pircs5  (0.318,  7098):  routing  with  learning  but 
no  query  expansion,  and  achieving  70.3%  and  89.6%  of  the 
maxi-system  values.  The  same  average  precision  value  and 
relevants  retrieved  for  ad  hoc  retrieval  pircs8  are  0.344  and 
8279  out  of  10785,  representing  79%  and  91.7%  of  the  ad 
hoc  maxi-system  respectively.  At  20  docimients  retrieved, 
the  precision  values  for  routing  and  ad  hoc  are  respectively 
0.583  and  0.564.  This  means  that  averaging  over  50 
queries,  out  of  the  first  20  retrieved  over  11  are  relevant. 
Considering  the  size  of  these  textbases,  these  are  quite  good 
results.  These  numbers  are  user-oriented,  and  users 
naturally  hope  to  see  1(X)%  precision.  As  discussed  in 
TRECl,  from  a  system  point  of  view  the  precision  at  n 
documents  retrieved  should  not  be  compared  to  the 
theoretical  value  of  1.0,  but  to  an  q)erational  precision 
value  x/n  if  the  total  number  of  relevants  x  for  a  query  is 
less  than  n.  For  example,  at  n=100  documents  retrieved  20 
routing  and  16  ad  hoc  queries  have  total  relevants  x  less 
than  100.  The  operational  maximum  precision  averaged 
over  50  queries  for  routing  is  only  0.8,  and  that  for  ad  hoc 
is  0.871.  At  100  documents,  routing  pircs6  value  of  0.439 
and  ad  hoc  pircs8  value  of  0.468  therefOTe  achieves  54.9% 
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Revised  Routing 


New  Ad  hoc 


pircsS             pircs6  %                pircs7  %  pircsS 

Total  number  of  documents  over  all  50  queries 

Retrieved:         50000             50000  50000  50000 

Relevant:          10489             10489  10489  10785 

ReLret:            7098              7476  +  5  J               7385  +  4.0  8279 

Interpolated  recall  -  precision  averages: 


0.0 

.765 

.814 

+  6.4 

.793 

+  3.7 

.823 

0.1 

.554 

.628 

+13.3 

.594 

+  7.2 

.561 

0.2 

.491 

.546 

+11.2 

.534 

+  8.6 

.505 

0.3 

.421 

.469 

+11.4 

.469 

+11.4 

.456 

0.4 

.380 

.413 

+  8,7 

.416 

+  9.5 

.417 

0.5 

.336 

.363 

+  8.0 

.368 

+  9.5 

.368 

0.6 

.270 

.310 

+14.8 

.314 

+16.3 

.320 

0.7 

.212 

.240 

+13.2 

.241 

+13.7 

.246 

0.8 

.150 

.165 

+10.0 

.171 

+14.0 

.160 

0.9 

.079 

.099 

+25.3 

.098 

+24.0 

.087 

1.0 

.014 

.006 

-57.1 

.015 

+  7.1 

.012 

Average  precision 

(non-interpolated)  over  all  rel 

docs 

.318 

.355 

+11.6 

.350 

+10.1 

.344 

Precision  at 

5  docs: 

.600 

.660 

+10.0 

.624 

+  4.0 

.612 

10  " 

.582 

.632 

+  8.6 

.598 

+  2.7 

.572 

15  " 

.545 

.617 

+13.2 

.572 

+  5.0 

.573 

20  " 

.527 

.583 

+10.6 

.563 

+  6.8 

.564 

30  " 

.507 

.553 

+  9.1 

.534 

+  5.3 

.540 

100  " 

.402 

.439 

+  92 

.427 

+  6.2 

.468 

200  " 

.334 

.360 

+  7.8 

.353 

+  5.7 

.396 

500  " 

.222 

.238 

+  7.2 

.232 

+  4.5 

.264 

1000  " 

.142 

.150 

+  5.6 

.148 

+  4.2 

.166 

R-precision 

Exact 

.358 

.385 

+  73 

.385 

+  7.5 

.378 

Table  2:  Revised  Routing  and  New  Ad  Hoc  Retrieval  Results 


and  53.7%  of  the  operational  maximum.  Tlie  R-precision 
Exact  calculates  the  precision  value  at  x  retrieved,  where  x 
is  the  known  number  of  relevants  for  each  query,  and  can 
be  compared  widi  the  theoretical  value  of  1.0. 

The  'Upperbound'  retrieval  pircs7  (suggested  by  Sparck- 
Jones)  means  performing  learning  from  the  known  Disk3 
(not  Disk  1«S^)  relevant  documents  before  retrieval.  In 
other  words,  we  assume  the  answer  documents  are  known 
for  training  and  represent  the  best  that  probabihstic  thecM7 
can  provide  using  our  system  This  is  however  not  the  true 
upperbound  [Spar79]  for  retrieval  from  Disk3.  because  the 
vocabulary  and  usage  statistics  are  still  those  of  Disk  1&2. 
The  vocabulary  is  retained  for  comparison  with  routing 


results.  pircs7  retrieval  achieves  average  precision  of  0.350, 
which  improves  over  pircs5  (training  from  Disk  1&2)  by 
about  10%  in  average  precision  and  about  4%  in  relevants 
retrieved.  Of  course  in  real  life,  the  answer  documents  are 
not  known;  but  it  is  interesting  to  note  that  query  expansion 
using  Disk  1&2  documents  can  provide  similar 
performance,  showing  the  importance  of  query  expansion. 

We  later  concentrate  on  routing  and  discover  that  additional 
gains  can  be  achieved  by  fine  tuning  of  the  parameters  in 
our  model.  For  learning:  1)  we  find  that  our  original 
method  of  using  only  'nonbreak'  documents  in  the  given  set 
of  relevants  actually  outperforms  other  document  selection 
strategies  including  using  aU  relevants,  'break  six'  or 
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New  Routing  Results 


No 

Train. 


Trained 
Exp  0 


Trained 
Exp  20 


% 


Trained 
Exp  40 


% 


Trained 
Exp  80 


% 


Trained  % 
Exp  100 


Total  number  of  documents  over  all  50  queries 


Retrieved: 
Relevant 
Rel  ret: 


50000 
10489 
6551 


50000 
10489 
6961 


+  6.0 


50000 
10489 
7496 


+14.0 


50000 
10489 
7646 


+17.0 


50000 
10489 
7695 


+17.0 


Interpolated  Recall  -  Precision  Averages: 


50000 
10489 

7712  +18.0 


0.00 

0.7200 

0.7480 

+  4.0 

0.8471 

+18.0 

0.8475 

+18.0 

0.8646 

+20.0 

0.8660 

+20.0 

0.10 

0.5124 

0.5815 

+13.0 

0.6645 

+30.0 

0.6751 

+32.0 

0.6810 

+33.0 

0.6801 

+33.0 

0.20 

0.4431 

05254 

+19.0 

0.5981 

+35.0 

0.6116 

+38.0 

0.6135 

+38.0 

0.6115 

+38.0 

0.30 

0.4016 

0.4728 

+18.0 

0.5371 

+34.0 

0.5413 

+35.0 

0.5465 

+36.0 

0.5452 

+36.0 

0.40 

0.3486 

0.4402 

+26.0 

0.4751 

+36.0 

0.4774 

+37.0 

0.4829 

+39.0 

0.4878 

+40.0 

0.50 

0.2970 

0.3862 

+30.0 

0.4167 

+40.0 

0.4288 

+44.0 

0.4229 

+42.0 

0.4214 

+42.0 

0.60 

0.2382 

0.3048 

+28.0 

0.3496 

+47.0 

0.3681 

+55.0 

0.3699 

+55.0 

0.3690 

+55.0 

0.70 

0.1945 

0.2430 

+25.0 

0.2772 

+43.0 

0.2880 

+48.0 

0.2815 

+45.0 

0.2843 

+46.0 

0.80 

0.1284 

0.1865 

+45.0 

0.1911 

+49.0 

0.1937 

+51.0 

0.1999 

+56.0 

0.2005 

+56.0 

0.90 

0.0740 

0.0860 

+16.0 

0.1130 

+53.0 

0.1144 

+55.0 

0.1219 

+65.0 

0.1238 

+67.0 

1.00 

0.0119 

0.0187 

+57.0 

0.0140 

+18.0 

0.0107 

-10.0 

0.0114 

-4.0 

0.0171 

+44.0 

Average  precision  (non-interpolated)  over  all  rel  docs 

02905  0.3517  +21.0  0.3962  +36.0 

Precision  at 

5  docs:        0.5600  0.5760  +  3.0  0.6960  +24.0 

10  docs:        0.5440  0.5820  +  7.0  0.6880  +26.0 

15  docs:        0.5173  0.5627  +  9.0  0.6573  +27.0 

20  docs:        0.4910  0.5510  +12.0  0.6470  +32.0 

30  docs:        0.4653  0.5313  +14.0  0.6147  +32.0 

100  docs:        03698  0.4396  +19.0  0.4824  +30.0 

200  docs:        0.3049  0.3562  +17.0  0.3887  +27.0 

500  docs:        0.2038  0.2241  +10.0  0.2452  +20.0 

1000  docs:       0.1310  0.1392  +6.0  0.1499  +14.0 


0.4050  +39.0    0.4084  +41.0    0.4095  +41.0 


0.7160 
0.6860 
0.6707 
0.6540 
0.6173 
0.4930 
0.3945 
0.2490 
0.1529 


+28.0 
+26.0 
+30.0 
+33.0 
+33.0 
+33.0 
+29.0 
+22.0 
+17.0 


0.7320 
0.6980 
0.6800 
0.6630 
0.6240 
0.4974 
0.4002 
0.2500 
0.1539 


+31.0 
+28.0 
+31.0 
+35.0 
+34.0 
+35.0 
+31.0 
+23.0 
+17.0 


0.7280 
0.7000 
0.6813 
0.6610 
0.6267 
0.5002 
0.4004 
0.2498 
0.1542 


+30.0 
+29.0 
+32.0 
+35.0 
+35.0 
+35.0 
+31.0 
+23.0 
+18.0 


R-Precision  (precision  after  R  (=  num_rel  for  a  query)  docs  retrieved): 
Exact:        03346  0  3942  +18.0    0.4251  +27.0    0.4283  +28.0 


0.4281  +28.0    0.4291  +28.0 


Table  3:  New  Routing  Results  at  Several  Query  Expansion  Levels 


ranking  and  selecting  the  n  best  Moreover,  these 
'nonhreak'  documents  total  only  5225.  less  flian  1/3  of 
161 14  relevants  used  and  is  therefore  very  efficient  (There 
are  actually  16400  relevants  from  Disk  1&2.  but  during 
processing  a  small  percentage  was  lost).  2)  All  edges  and 
their  weights  on  the  query  side  of  the  networic  are  defined 
by  the  activations  deposited  by  the  relevant  documents;  this 
means  the  original  query  plays  no  part  in  their  definition. 
3)  Negative  edge  weights  are  set  to  small  positive  weights 
of  0.1.  For  retrieval:  4)  after  ranking,  several  subdocxmients 


of  the  same  document  ID  may  rank  high,  and  we  combine 
their  largest  three  RSVs  in  the  ratio  of  1:0.2:0.05  as  the 
single  reported  RSV  fcs  the  whole  document  Previously 
we  ignored  the  tiiird,  and  the  ratio  for  combining  the  largest 
two  was  different  We  choose  to  stop  at  two  or  three 
subdocuments  because  noise  from  long  documents  may 
creep  back.  Such  timing  of  parameters  led  to  the  results  in 
Table  3  for  our  latest  routing  results.  We  use  the 
convention  'Trained  Exp  K'  to  denote  query  expansion  level 
K,  with  K=0  meaning  weight  adaptation  without  adding  new 
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terms.  The  'No  Train.'  column  shows  resvilts  without  using 
any  known  relevants  for  training  and  serve  as  the  basis  for 
comparison.  It  can  be  seen  using  the  measure  of  average 
precison  over  all  recall  points  that  training  without  query 
expansion  improves  over  no  training  by  21%,  and  training 
with  expansion  at  the  40  level  improves  over  the  basis  by 
about  39%.  This  measure,  as  well  as  the  R-precision, 
seems  to  level  off  from  expansion  level  40  onwards. 
However  the  number  of  relevants  retrieved  improves  from 
6551  (no  training)  to  7712  (expansion  level  100)  in  a 
mcQOtoae  fashion.  Higher  query  expansion  level  appears  to 
improve  the  high-recall  region  of  the  precision-recall  curve 
without  materially  affecting  the  low-recall  region  as 
observed  in  [Kwok9x]  using  the  WSJ  collection  only. 
Precision  values  at  the  different  cutoffs  of  documents 
retrieved  seem  to  level  off  at  the  expansion  level  of  80.  At 
20  retrieved  documents  cutoff,  we  now  achieve  a  precision 
over  0.65,  meaning  that  more  than  13  of  the  20  documents 
retrieved  are  relevant  on  the  average.  The  tuning  of 
parameters  give  us  over  10%  additional  improvements 
above  those  obtained  in  the  revised  routing  results  of  Table 
2.  It  appears  that  a  query  expansion  level  of  40  achieves  a 
compromise  between  good  effectiveness  and  good 
efficiency  for  our  system.  We  did  not  do  massive  query 
expansion  at  high  levels  of  200  or  more.  However,  the 
results  are  comparable  to  the  best  of  those  reported  in  the 
TREC2  conference. 


9.  Conclusion 

We  have  upgraded  our  PIRCS  system  to  use  dynamic 
network  creation  for  learning  and  retrieval,  and  to  handle 
files  in  a  master-subcollection  design.  The  former  approach 
allows  us  to  eliminate  full  inverted  file  creation  resulting  in 
2  x  collection  size  space  requirement,  reduced  'dead'  time 
for  a  collection  to  be  searchable,  and  provide  fast  learning. 
The  latter  approach  renders  our  system  to  be  sufficiently 
flexible  to  handle  a  large  nimiber  of  files  in  a  robust 
fashion,  yet  produce  a  retrieval  ranked  list  as  if  all 
documents  were  in  one  file.  Although  our  submitted  results 
for  TREC2  were  not  up  to  expectation  because  of 
insufficient  resources  at  the  time  of  the  experiments,  the 
reasons  for  the  behavior  of  our  system  were  isolated.  New 
experiments  show  that  PIRCS  can  provide  highly 
competitive  retrieval  effectiveness  in  both  ad  hoc  and 
routing  environments. 
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Combination  of  Multiple  Searches 


Edward  A.  Fox  and  Joseph  A.  Shaw 
Department  of  Computer  Science 
Virginia  Tech,  Blacksburg,  VA  24061-0106 


Abstract 

The  TREC-2  project  at  Virginia  Tech  focused  on  meth- 
ods for  combining  the  evidence  from  multiple  retrieval 
runs  to  improve  retrieval  performance  over  any  sin- 
gle retrieval  method.  This  paper  describes  one  such 
method  that  has  been  shown  to  increase  performance 
by  combining  the  similarity  values  from  five  different 
retrieval  runs  using  both  vector  space  and  P-norm  ex- 
tended boolean  retrieval  methods. 

1  Overview 


Table  1:  SMART  weighting  schemes  used  for  TREC-2. 


SMART 

label 

term_weight  — 

ann 

0.5  +  0.5* 

bnn 

1 

mnn 

*/ 
max-t  f 

ain 

0.5  +  „           *  /o^f        °" ) 

2*max-tf         ''V  coll-freq  ' 

nnn 

tf 

The  primary  focus  of  our  experiments  at  Virginia  Tech 
involved  methods  of  combining  the  results  from  vari- 
ous divergent  search  schemes  and  document  collections. 
We  performed  both  routing  and  ad-hoc  retrieval  experi- 
ments on  the  provided  test  collections.  The  results  from 
both  vector  and  P-norm  type  queries  were  considered  in 
determining  the  probability  of  relevance  for  each  docu- 
ment in  an  individual  collection.  The  results  for  each 
collection  were  then  merged  to  create  a  single  final  set 
of  documents  that  would  be  presented  to  the  user. 

2    Index  Creation 

This  section  outlines  the  indexing  done  with  the  doc- 
ument collection  provided  by  NIST.  Each  of  the  indi- 
vidual collections  was  indexed  separately  as  document 
vector  files;  limitations  in  disk  space  prohibited  the  use 
of  inverted  files  and  the  creation  of  a  single  combined 
document  vector  file. 

All  processing  was  performed  on  a  DECstation 
5000/25  with  40  MB  of  RAM  using  the  1985  release 
of  the  SMART  Information  Retrieval  System  [2],  with 
enhancements  from  previous  experiments  as  well  as  a 
new  modification  for  our  TREC-2  experiments. 

The  index  files  were  created  from  the  source  text  via 
the  following  process.  First,  the  source  document  text 
provided  by  NIST  was  passed  through  a  preparser  to 
convert  the  SGML-like  format  to  the  proper  format  for 


the  1985  version  of  SMART.  The  extraneous  sections 
of  the  documents  were  filtered  out  at  this  point.  The 
TEXT  sections  of  the  documents,  as  well  as  the  various 
HEADLINE,  TITLE,  SUMMARY,  and  ABSTRACT 
sections  of  the  collections  were  indexed;  all  of  the  other 
sections  were  ignored.  The  subsections  of  the  TEXT 
fields,  where  they  existed,  were  considered  as  part  of  the 
TEXT  field,  with  the  subsection  delimiters  removed. 

The  resulting  filtered  text  was  tokenized,  stop  words 
were  deleted  using  the  standard  418  word  stop  list 
provided  with  SMART,  and  the  remaining  non-noise 
words  were  included  in  the  term  dictionary  along  with 
their  occurrence  frequencies.  Each  term  in  the  dictio- 
nary has  a  unique  identification  number.  A  document 
vector  file  was  created  during  indexing  which  contains 
for  each  document  its  unique  ID,  and  a  vector  of  term 
IDs  and  term  weights.  The  initially  recorded  weights 
can  be  changed  based  on  one  of  several  schemes  after 
the  indexing  is  complete.  The  various  SMART  weight- 
ing schemes  referred  to  within  this  paper  are  summa- 
rized in  Table  1.  The  dictionary  size  for  each  collection 
was  approximately  16  MB,  while  the  document  vector 
files  ranged  from  31  MB  to  124MB  (see  Table  2). 
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Table  2:  Collection  statistics  summary.  Text,  Dictio- 
nary and  Document  Vector  sizes  in  Megabytes. 


LJOC* 

xoiai 

i_/Oiieciioii 

xext 

V  tiC  LOl  a 

Docs 

AP-1 

266 

16.0 

120,2 

84678 

DOE-1 

190 

15.9 

97.9 

226087 

FR-1 

258 

ICO 

15.8 

CO  o 

53.8 

26207 

WST-I 

VV  OJ  L 

1 6  2 

1 24  8 

ZIFF-1 

251 

15.7 

88.4 

75180 

Dl 

1260 

N/A 

485.1 

510887 

AP-2 

248 

15.9 

110.4 

79923 

FR-2 

211 

15.6 

42.7 

20108 

WSJ-2 

255 

16.0 

105.5 

74520 

ZIFF-2 

188 

15.4 

63.6 

56920 

D2 

902 

N/A 

322.2 

231471 

Dl  &  D2 

2162 

N/A 

807.3 

742358 

AP-3 

250 

15.9 

111.2 

78325 

PATN-3 

254 

15.6 

31.3 

6711 

SJM-3 

319 

16.1 

114.4 

90257 

ZIFF-3 

362 

16.0 

109.8 

161021 

D3 

1185 

N/A 

366.7 

336314 

Total 

3347 

N/A 

1174.0 

1078672 

3  Retrieval 

3.1  Queries 

All  of  the  queries  were  created  from  the  topic  descrip- 
tions provided  by  NIST.  Two  types  of  queries  were  used, 
P-norm  extended  boolean  queries  and  natural  language 
vector  queries.  A  single  set  of  P-norm  queries  was  cre- 
ated, but  was  interpreted  multiple  times  with  different 
operator  weights  (P- values).  Two  different  sets  of  vec- 
tor queries  were  created  from  the  topics,  one  contain- 
ing information  from  fewer  sections  of  a  topic  descrip- 
tion. The  Title,  Description  and  Concepts  sections  of 
the  topic  descriptions  were  used  in  the  creation  of  all 
three  sets  of  queries,  the  Definitions  section  was  used 
also  in  both  sets  of  vector  queries,  while  the  P-norm 
query  set  and  one  of  the  vector  query  sets  also  con- 
tained information  from  the  Narrative  section  of  the 
topic  descriptions.  The  vector  query  set  that  included 
the  Narrative  section  of  the  topic  is  referred  to  as  the 
long  vector  query  set,  for  obvious  reasons,  while  the 
other  is  referred  to  as  the  short  vector  query  set. 

The  P-norm  queries  were  written  as  complex  boolean 
expressions  using  AND  and  OR  operators.  Phrases 
were  simulated  using  AND  operators  since  the  queries 
were  intended  only  for  soft-boolean  evaluation.  The 
query  terms  were  not  specifically  weighted;  uniform  op- 
erator weights  (P- values)  of  1.0,  1.5  and  2.0  were  used 


Table  4:  Summary  of  the  five  individual  runs. 


Title 

Query  Type 

Similarity  Measure 

SV 

Short  vector 

Cosine  similarity 

LV 

Long  vector 

Cosine  similarity 

Pnl.O 

P-norm 

P-norm,  P  =  1.0 

Pnl.5 

P-norm 

P-norm,  P  =  1.5 

Pn2.0 

P-norm 

P-norm,  P  =  2.5 

on  different  evaluations  of  the  query  set. 

3.2    Individual  Retrieval  Runs 

The  first  step  in  our  TREC-2  experiments  involved  de- 
termination of  what  weighting  schemes  would  be  most 
effective  for  P-norm  queries.  Our  TREC-1  experiments 
with  P-norm  queries  had  obtained  mixed  results,  per- 
forming poorly  based  on  binary  document  term  weights 
in  our  Phase  I  experiments  and  performing  well  for  a  P- 
value  of  1.0  and  very  poor  with  larger  P- values  in  our 
Phase  II  experiments  using  a  tf-idf  weighting  scheme 
[4] .  We  performed  several  P-norm  retrieval  runs  on  the 
two  AP  and  two  WSJ  training  collections  with  topics 
51  to  100  to  determine  the  most  effective  term  weight- 
ing scheme  for  P-norm  queries  with  large  test  collec- 
tions. The  results  from  these  experiments  are  shown 
in  Table  3  using  the  standard  TREC-2  average  non- 
interpolated  precision  and  the  exact  R-precision  mea- 
sures. The  most  effective  weighting  scheme  turned  out 
to  be  the  SMART  ann  weighting  scheme,  which  con- 
firmed the  result  obtained  originally  by  Fox  for  the 
much  smaller  classical  document  collections  [3]. 

The  two  sets  of  vector  queries  were  evaluated  us- 
ing the  standard  cosine  correlation  similarity  method 
as  implemented  by  SMART.  The  same  SMART  ann 
weighting  scheme  used  for  the  P-norm  queries  was  used 
on  the  vector  queries  for  several  reasons.  First,  a 
weighting  scheme  that  did  not  use  any  collection  statis- 
tics was  needed  for  the  routing  experiments.  Second, 
the  methods  used  in  combining  runs  described  in  the 
next  section  required  a  similar  range  of  possible  simi- 
larity values  produced  by  each  run.  Finally,  the  neces- 
sity of  merging  results  from  each  collection  into  a  single 
set  of  results  was  simplified  since  the  resulting  similar- 
ity values  were  not  based  on  collection  statistics  which 
would  have  differed  for  each  collection.  The  P-norm 
queries  were  evaluated  using  three  different  P-values, 
again  using  the  SMART  ann  weighting  scheme  based 
on  specific  P-norm  experiments  described  below.  The 
five  individual  runs  are  summarized  in  Table  4. 

The  five  individual  runs  were  performed  and  evalu- 
ated for  each  of  the  nine  training  collections  on  topics 
51  to  100.  The  results  for  these  experiments  are  given  in 
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Table  3:  Average  Precision  and  Exact  R-Precision  for  P-norm  experiments  on  weighting  with  the  AP  and  WSJ 
collections  (Ad-hoc  Topics  51-100). 


Average  Precision 

R- Precision 

Coll. 

P- value 

ann 

bnn 

mnn 

ann 

bnn 

mnn 

1.0 

n  941Q 

fl  141Q 

n  9fif^n 

u.zuuu 

AP-1 

1.5 

0.3122 

0.2581 

0.1444 

0.2976 

0.2732 

0.1757 

2.0 

n  9^in 

D  14^7 

'J .  Z  1  to 

n  1707 

1.0 

u.ouu^ 

n  9fi79 

U.ZD 1 Z 

n  1  S9R 

U.OIDO 

1  5 

0.3332 

0.2999 

0.1831 

0.3412 

0.3118 

0.2161 

2.0 

0.3300 

0.2922 

0.1847 

0.3339 

0.3057 

0.2284 

1.0 

0.2941 

0.2485 

0.1742 

0.3221 

0.2830 

0.2181 

WSJ-1 

1.5 

0.3199 

0.2753 

0.1774 

0.3443 

0.2994 

0.2225 

2.0 

0.3217 

0.2752 

0.1776 

0.3470 

0.3013 

0.2277 

1.0 

0.2206 

0.1881 

0.1356 

0.2367 

0.2094 

0.1722 

WSJ-2 

1.5 

0.2327 

0.2013 

0.1174 

0.2511 

0.2234 

0.1549 

2.0 

0.2325 

0.1970 

0.1098 

0.2442 

0.2158 

0.1445 

Table  5.  In  general,  the  P-norm  queries  performed  bet- 
ter than  the  vector  queries.  The  most  effective  P-value 
however  differed  between  the  collections:  The  AP  runs 
performed  better  with  a  P-value  of  1.5,  while  a  P-value 
of  2.0  performed  better  for  the  WSJ  collections. 

3.3    Combination  Retrieval  Runs 

Our  experiments  in  TREC-1  involved  combining  the 
results  from  several  different  retrieval  runs  for  a  given 
collection  either  simply  taking  the  top  N  documents  re- 
trieved for  each  run,  or  modifying  the  value  of  N  for 
each  run,  based  on  the  eleven  point  average  precision 
for  that  run.  We  felt  these  efforts  suffered  from  con- 
sidering only  the  rank  of  a  retrieved  document  and  not 
the  actual  similarity  value  itself.  In  TREC-2,  our  ex- 
periments concentrated  on  methods  of  combining  runs 
based  on  the  similarity  values  of  a  document  to  each 
query  for  each  of  the  runs.  Additionally,  combining  the 
similarities  at  retrieval  time  had  the  advantage  of  extra 
evidence  over  combining  separate  results  files  since  the 
similarity  of  every  document  for  each  run  was  available 
instead  of  just  the  similarities  for  the  top  1000  docu- 
ments for  each  run.  While  our  results  for  four  of  the 
training  collections  indicated  that  the  P-norm  queries 
performed  better  than  the  vector  queries,  this  result 
was  likely  specific  to  the  actual  queries  involved  and 
not  necessarily  true  in  general.  This  lead  to  a  decision 
to  weight  each  of  the  separate  runs  equally  and  not  fa- 
vor any  individual  run  or  method.  In  general,  it  may 
be  desirable  or  necessary  to  weight  a  single  run  more, 
or  less,  depending  on  its  overall  performance;  this  could 
be  especially  useful  in  a  routing  situation. 

For  any  given  information  retrieval  ranking  metohd, 
there  are  two  primary  types  of  errors  that  can  occur: 


Table  6:  Formulas  for  combining  similarity  values. 


Name 

Combined  Similarity  — 

CombMAX 

M  AX  (Individual  Similarities) 

CombMIN 

M I N {Individual  Similarities) 

CombSUM 

SU M {Individual  Similarities) 

CombANZ 

SU M {Individual  Similarities) 
Number  of  Nonzero  Similarities 

CombMNZ 

SU M{Individual  Similarities)* 
Number  of  Nonzero  Similarities 

CombMED 

M E D{Individual  Similarities) 

assigning  a  relatively  high  rank  to  a  non-relevant  docu- 
ment, and  assigning  a  relatively  low  rank  to  a  relevant 
document.  It  has  been  shown  that  different  retrieval 
paradigms  will  perform  differently  on  the  same  set  of 
data,  often  will  little  overlap  in  the  set  of  retrieved  doc- 
uments. [5]  For  instance,  when  one  retrieval  method 
assigns  a  high  rank  to  a  non-relevant  document,  a  differ- 
ent retrieval  method  is  likely  to  assign  that  document  a 
much  lower  rank.  Similarly,  when  one  retrieval  method 
fails  to  assign  a  high  rank  to  a  relevant  document,  a 
different  retrieval  method  is  likely  to  assign  that  doc- 
ument a  high  rank.  This  characteristic  of  information 
retrieval  methods  indicates  that  some  method  for  con- 
sidering both  retrieval  methods  together  should  help  to 
decease  the  probability  of  this  happening;  of  course,  it 
is  also  possible  for  both  methods  to  highly  rank  a  non- 
relevant  document  or  to  poorly  rank  a  relevant  docu- 
ment. 

Six  methods  of  combining  the  similarity  values  were 
tested  in  our  TREC-2  experiments,  as  summarized  in 
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Table  5:  Average  Precision  and  Exact  R-Precision  for  the  five  individual  runs  (Ad-hoc  Topics  51-100). 


Average  non-interpolated  Precision 


Run 

Disk  1 

Dis 

k  2 

Both 
Disks 

AP 

DOE 

FR 

WSJ 

ZF 

AP 

FR 

WSJ 

ZF 

SV 

0.2387 

0.0605 

0.0222 

0.2203 

0.1026 

0.2543 

0.0330 

0.1503 

0.0770 

0.1418 

LV 

0.2435 

0.0586 

0.0302 

0.2414 

0.0864 

0.2664 

0.0324 

0.1633 

0.0753 

0.1555 

Pnl.O 

0.2605 

0.0658 

0.0611 

0.2941 

0.1110 

0.3004 

0.0879 

0.2206 

0.1003 

0.1988 

Pnl.5 

0.2939 

0.0771 

0.0639 

0.3199 

0.1278 

0.3332 

0.0878 

0.2327 

0.1065 

0.2242 

Pn2.0 

0.2849 

0.0847 

0.0706 

0.3217 

0.1278 

0.3300 

0.0865 

0.2325 

0.1136 

0.2250 

CombSUM 

0.3493 

0.1001 

0.0741 

0.3605 

0.1475 

0.3748 

0.0842 

0.2752 

0.1273 

0.2620 

Chg/Max 

18.84% 

18.18% 

4.95% 

12.06% 

15.41% 

12.48% 

-4.20% 

18.26% 

12.05% 

16.44% 

Exact  R-Precision 


Run 

Disk  1 

Disk  2 

Both 
Disks 

AP 

DOE 

FR 

WSJ 

ZF 

AP 

FR 

WSJ 

ZF 

SV 

0.2624 

0.0564 

0.0183 

0.2616 

0.1180 

0.2649 

0.0202 

0.1744 

0.0922 

0.2169 

LV 

0.2672 

0.0493 

0.0274 

0.2800 

0.0802 

0.2704 

0.0176 

0.1860 

0.0843 

0.2311 

Pnl.O 

0.2688 

0.0661 

0.0533 

0.3221 

0.1123 

0.3165 

0.0971 

0.2367 

0.0969 

0.2708 

Pnl.5 

0.2976 

0.0762 

0.0572 

0.3443 

0.1218 

0.3412 

0.1016 

0.2511 

0.1068 

0.2962 

Pn2.0 

0.2968 

0.0765 

0.0654 

0.3470 

0.1254 

0.3339 

0.0820 

0.2442 

0.1158 

0.3008 

CombSUM 

0.3590 

0.0950 

0.0619 

0.3767 

0.1357 

0.3732 

0.0887 

0.2851 

0.1216 

0.3292 

Chg/Max 

20.63% 

24.18% 

-5.35% 

8.55% 

8.21% 

9.37% 

-12.69% 

13.54% 

5.00% 

9.44% 

Table  6.  The  rational  behind  the  CombMIN  combi- 
nation method  was  to  minimize  the  probability  that  a 
non-relevant  document  would  be  highly  ranked,  while 
the  purpose  of  the  CombMAX  combination  method  was 
to  minimize  the  number  of  relevant  documents  being 
poorly  ranked.  There  is  an  inherent  flaw  with  both  of 
these  methods;  namely,  they  are  specialized  to  handle 
specific  problems  without  regard  to  their  eff'ect  on  the 
other  retrieved  documents:  for  example,  the  CombMIN 
combination  method  will  promote  the  type  of  error  that 
the  CombMAX  method  is  designed  to  minimize,  and 
vice  versa.  The  CombMED  combination  method  is  a 
simplistic  approach  to  handling  this,  using  the  median 
similarity  value  to  avoid  both  scenarios.  What  is  clearly 
needed  is  some  method  of  considering  the  documents' 
relative  ranks,  or  similarity  values,  instead  of  simply 
attempting  to  select  a  single  similarity  value  from  a  set 
of  runs.  To  this  end,  we  tried  three  other  methods  of 
combining  retrieval  methods.  CombSUM,  the  summa- 
tion of  the  set  of  similarity  values,  or,  equivalently,  the 
numerical  mean  of  the  set  of  the  set  of  similarity  val- 
ues; CombANZ,  the  average  of  the  non-zero  similarity 
values,  that  ignores  the  effects  of  a  single  given  run 
or  query  failing  to  retrieve  a  relevant  document;  and 
CombMNZ  to  provide  higher  weights  to  documents  re- 
trieved by  multiple  retrieval  methods.  Cle^arly*,  there  are 
more  possibilities  to  consider;  the  advantages  of  those 


chosen  are  simplicity,  in  terms  of  both  execution  effi- 
ciency and  implementation,  and  generality,  in  terms  of 
not  being  specific  to  a  given  method  or  retrieval  run. 

These  six  methods  were  evaluated  against  the  AP  and 
WSJ  test  collections  for  topics  51  through  100,  combin- 
ing the  similarity  values  of  each  of  the  five  individual 
runs  specified  above.  The  results  are  shown  in  Table  7 
below  the  results  of  each  of  the  corresponding  individ- 
ual runs  from  Table  5.  Note  that  while  the  CombMAX 
runs  performed  well  compared  with  most  of  the  indi- 
vidual runs,  they  did  not  do  as  well  as  the  best  of  the 
individual  runs  in  most  cases.  The  CombMIN  runs  per- 
formed similarly  for  the  AP  collection,  but  performed 
worse  than  every  individual  run  for  the  WSJ  collection. 

The  CombANZ  runs  and  the  CombMNZ  runs  both 
performed  better  than  the  best  of  the  individual  runs, 
with  the  CombMNZ  runs  performing  only  slightly  bet- 
ter than  the  combANZ  runs  for  three  of  the  four  collec- 
tions, and  performing  basically  the  same  for  the  fourth. 
The  primary  reason  for  the  similar  performance  of  the 
two  runs  is  that  the  two  methods  produce  the  same 
ranked  sequence  of  for  all  the  documents  retrieved  by 
all  five  individual  runs.  Thus,  the 

The  CombSUM  retrieval  run  was  performed  for  each 
of  the  nine  collections  on  the  two  training  CD-ROMs. 
The  results  are  shown  in  Table  5.  Breaking  this  anal- 
ysis down  to  a  per  topic  basis  in  Table  11,  it  can  be 
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seen  that  the  CombSUM  method  performs  significantly 
better  than  the  best  single  individual  run,  Pn2.0;  a  two- 
tailed  paired  i  test  on  the  CombSUM  and  Pn2.0  average 
precisions  results  in  a  p  value  of  3.1e-05,  which  indi- 
cates these  results  are  conclusive.  However,  comparing 
the  CombSUM  results  with  the  best  individual  runs  for 
each  query  basis,  results  in  a  p  value  of  approximately 
about  0.16,  indicating  that  there  is  a  16  percent  chance 
that  the  CombSUM  method  is  no  better  than  the  best 
individual  run,  Pn2.0,  for  any  specific  query.  Perform- 
ing the  same  calculation  on  the  R-Precision  results  in 
similar  significance  findings. 

While  combining  all  five  runs  produced  an  overall 
improvement  in  retrieval  effectiveness  over  each  of  the 
runs,  the  same  does  not  always  hold  true  when  com- 
bining only  two  or  three  runs.  Each  of  the  ten  combi- 
nations of  two  CombSUM  runs  was  performed  for  both 
of  the  AP  test  collections,  as  well  as  a  run  combining 
all  three  of  the  P-norm  runs.  The  results  of  these  are 
given  in  Table  8.  Most  of  the  combinations  of  two  runs 
performed  worse  than  the  better  of  the  two  runs  while 
performing  better  than  the  poorer  of  the  two  runs.  One 
notable  exception  to  this  is  the  combination  of  the  two 
vector  runs,  which  performed  noticeably  poorer  than 
either  of  the  two  runs. 

3.4    Collection  Merging 

The  retrieval  results  for  each  of  the  collections  were 
combined  by  simply  merging  the  results  based  solely  on 
the  combined  similarity  values.  Since  the  retrieval  runs 
were  based  on  term  weights  without  collection  statis- 
tics such  as  inverse  document  frequency,  the  similarity 
values  were  directly  comparable  across  collections.  The 
results  of  merging  the  CombSUM  results  by  summed 
similarity  value  for  both  disks,  is  shown  in  the  last  col- 
umn of  Table  5. 

4    TREC-2  Results 

The  procedure  described  above  was  used  for  both  our 
official  TREC-2  routing  and  ad-hoc  results.  The  exact 
queries  for  ad-hoc  topics  51  to  100  used  for  testing  our 
above  method  were  used  for  the  routing  queries  against 
the  new  collections  on  disk  3.  The  results  obtained 
from  performing  the  CombSUM  retrieval  runs  for  each 
of  the  four  collections  as  well  as  the  merged  results  are 
shown  in  Table  9.  The  two  CombSUM  entries  in  the 
last  column  of  table  are  the  official  TREC-2  results. 
Since  we  concentrated  on  the  ad-hoc  evaluations,  these 
routing  results  are  included  primarily  for  the  benefit 
of  other  groups,  for  purposes  of  comparison.  The  ad- 
hoc  queries  for  topics  101  to  150  were  evaluated  in  the 
same  manner,  and  are  reported  in  Table  10.  Again, 


the  official  results  are  the  two  CombSUM  entries  in  last 
column  of  the  table. 

As  can  be  seen  from  Table  12,  the  CombSUM  method 
performs  quite  poorly  for  certain  topics  while  perform- 
ing very  well  for  others,  compared  to  the  best  single 
run's  results  that  that  topic.  Comparing  the  Comb- 
SUM results  to  the  single  best  individual  run  (Pn2.0) 
shows  an  improvement  for  46  out  of  the  50  topics,  which 
shows  that  the  CombSUM  run  performs  much  better 
than  any  single  individual  run.  Performing  a  two-tailed 
paired  i  test  on  the  Pn2.0  and  CombSUM  precisions 
results  in  a  p  value  of  about  l.le-11,  which  indicates 
these  results  are  very  conclusive.  However,  comparing 
the  CombSUM  results  with  the  best  individual  runs 
on  a  per  query  basis  results  in  a  p  value  of  about  0.2, 
indicating  that  there  is  a  20  percent  chance  that  the 
CombSUM  method  is  no  better  than  the  best  individ- 
ual run  for  each  specific  query.  Again,  performing  the 
same  calculation  on  the  R-Precision  results  in  similar 
values. 

4.1  The  CEO  Model 

The  Combination  of  Expert  Opinion  (CEO) 
model  [6,  7]  of  Thompson  can  be  used  to  treat  the  dif- 
ferent retrieval  methods  as  experts,  and  allows  combin- 
ing their  weighting  probability  distributions  to  improve 
performance.  This  could  be  used  in  a  variety  of  ways 
to  combine  results  from  a  variety  of  runs  and  indexing 
schemes  (that  could  include  stemming  and/or  morpho- 
logical analysis).  For  TREC-2,  the  CEO  experiments 
completed  consisted  of  combining  seven  individual  runs, 
the  three  P-norm  extended  boolean  retrieval  run  types 
described  above,  and  retrieval  runs  based  on  the  long 
vector  queries,  using  both  cosine  correlation  and  inner 
product  similarity  measures  for  SMART  system  term 
weighting  schemes  of  nnn  and  atn.  Further  discussion 
of  this  process  and  the  results  are  described  elsewhere 
in  these  proceedings. 

4.2  Evaluation 

Improvements  in  retrieval  effectiveness  from  combining 
the  evidence  from  multiple  sources  of  evidence  has  been 
performed  before  in  various  incarnations,  most  recently 
by  Belkin  ei  al.  [1]  who  evaluated  the  progressive  effect 
of  considering  multiple  soft  boolean  representations  to 
improve  on  a  base  INQUERY  natural  language  retrieval 
run.  In  their  experiments,  the  base  INQUERY  natural 
language  run  performed  better  than  any  of  the  boolean 
representations,  and  they  report  that  combining  the  re- 
sults from  the  natural  language  representation  and  the 
combined  boolean  representations  with  equal  weights 
performed  worse  than  the  best  single  run.  Not  until 
weighting  the  natural  language  run  four  times  more 
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Table  7:  Comparison  of  combination  runs  and  the  five  individual  runs  (Ad-hoc  Topics  51-100). 


Average 

Precision 

R-Precision 

Run 

AP-1 

W  IsJ-z 

A  ID  1 

Atr-L 

WsJ-l 

Ar-Z 

sv 

0.2387 

0.2203 

0.2543 

0.1503 

0.2624 

0.2616 

0.2649 

0.1744 

LV 

0.2435 

0.2414 

0.2664 

0.1633 

0.2672 

0.2800 

0.2704 

0.1860 

Pnl.O 

0.2810 

0.2941 

0.3004 

0.2206 

0.2688 

0.3221 

0.3165 

0.2367 

Pnl.5 

0.3122 

0.3199 

0.3332 

0.2327 

0.2976 

0.3443 

0.3412 

0.2511 

Pn2.0 

0.3027 

0.3217 

0.3300 

0.2325 

0.2968 

0.3470 

0.3339 

0.2442 

CombMAX 

0.2856 

0.3205 

0.3337 

0.2343 

0.3013 

0.3484 

0.3431 

0.2449 

CombMIN 

0.2863 

0.1924 

0.3047 

0.1308 

0.3036 

0.2214 

0.2980 

0.1395 

CombSUM 

0.3493 

0.3605 

0.3748 

0.2752 

0.3590 

0.3767 

0.3732 

0.2851 

CombANZ 

0.3493 

0.3367 

0.3748 

0.2465 

0.3590 

0.3517 

0.3732 

0.2590 

CombMNZ 

0.3059 

0.3368 

0.3516 

0.2467 

0.3175 

0.3517 

0.3578 

0.2590 

CombMED 

0.2943 

0.3204 

0.3335 

0.2328 

0.2977 

0.3444 

0.3414 

0.2518 

than  the  combined  boolean  schemes  did  they  experi- 
ence improved  retrieval  performance  when  combining 
different  query  methods.  This  differs  from  our  results 
in  several  ways.  Most  importantly,  the  stage  at  which 
we  combine  the  different  methods  differed:  Belkin  et  al. 
combined  the  query  representations  before  performing 
the  actual  retrieval,  while  we  combined  the  similarity 
values  produced  from  retrieval  on  each  method  individ- 
ually. The  difference  between  the  two  methodologies 
can  best  be  demonstrated  using  the  standard  vector 
space  model:  Belkin  et  al.  combined  by  summing  the 
vector  representations  of  each  query,  while  our  method 
is  analogous  to  summing  the  cosines  of  the  angles  be- 
tween each  vector  and  a  document.  It  is  easily  shown 
that  the  cosine  of  the  angle  between  a  document  vec- 
tor and  a  combined  query  vector,  that  is  the  sum  of 
two  query  vectors  as  in  the  Belkin  et  al.  approach,  is 
not  equal  to  the  sum  of  the  cosines  between  a  docu- 
ment vector  and  the  two  separate  query  vectors.  Other 
differences  between  the  two  methodologies  include  the 
fact  that  our  P-norm  queries  performed  better  on  av- 
erage than  our  natural  language  vector  queries,  with 
exceptions  on  a  per  query  basis.  We  used  only  one  P- 
norm  query  and  modified  the  operator  weights  while 
Belkin  et  al.  used  five  different  boolean  queries.  Fi- 
nally, combining  with  five  runs  with  equal  weights  ac- 
tually improved  performance  over  each  individual  run. 
However,  one  common  trend  emerges  from  both  exper- 
iments: the  more  query  representations  considered,  the 
better  the  results. 

4.3    Future  Exploration 

Planned  future  work  includes  studying  the  following: 

•  Individually  weighting  various  methods'  similarity 
values  when  performing  combination  runs. 


•  Normalization  methods  to  allow  combination  of 
runs  made  with  different  weighting  schemes. 

•  Extending  the  analysis  to  all  combinations  of  three 
and  four  retrieval  runs. 

•  Considering  more/different  query  types. 
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0.2664 
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0.2704 

0.1860 

Pnl.O 
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0.2518 
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0.3413 
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0.2442 

0.3181 

0.3596 

0.3624 

0.2568 

LV  and  Pn2.0 
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0.3408 

0.3518 

0.2458 

0.3141 

0.3608 

0.3596 

0.2615 

Pnl.O  and  Pnl.5 

0.2817 

0.3109 
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0.2324 

0.2898 

0.3412 

0.3395 

0.2476 

Pnl.O  and  Pn2.0 
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0.3330 

0.2367 

0.2944 
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0.3397 

0.2507 
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0.2328 
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Table  9:  Average  Precision  and  Exact  R-Precision  for  the  five  individual  runs  compared  with  the  combined  CombSUM 
runs  (Routing  Topics  51-100). 


Average  non- interpolated  Precision 


Run 

AP-3 

PATN-3 

SJM-3 

ZF-3 

Disk  3 

sv 

0.1347 

0.0189 

0.1139 

0.0593 

0.0589 

LV 

0.1189 

0.0156 

0.1056 

0.0587 

0.0494 

Pnl.O 

0.2519 

0.0257 

0.2128 

0.1141 

0.2039 

Pnl.5 

0.2869 

0.0239 

0.2411 

0.1189 

0.2279 

Pn2.0 

0.2852 

0.0221 

0.2390 

0.1303 

0.2225 

CombSUM 

0.3196 

0.0260 

0.2696 

0.1304 

0.2681 

Chg/Max 

11.4% 

1.2% 

11.8% 

0.07% 

17.6% 

Exact  R- Precision 

Run 

AP-3 

PATN-3 

SJM-3 

ZF-3 

Disk  3 

SV 

0.1703 

0.0171 

0.1337 

0.0595 

0.0595 

LV 

0.1444 

0.0156 

0.1098 

0.0547 

0.1002 

Pnl.O 

0.2790 

0.0325 

0.2185 

0.1224 

0.2594 

Pnl.5 

0.3082 

0.0322 

0.2579 

0.1248 

0.2786 

Pn2.0 

0.3062 

0.0310 

0.2531 

0.1462 

0.2809 

CombSUM 

0.3319 

0.0319 

0.2900 

0.1260 

0.3143 

Chg/Max 

7.7% 

-1.8% 

12.4% 

-13.8% 

11.9% 

Table  10:  Average  Precision  and  Exact  R-Precision  for  the  five  individual  runs  (Ad-hoc  Topics  101-150). 


Average  non-interpolated  Precision 


Run 

Disk  1 

Disk  2 

Both 
Disks 

AP 

DOE 

FR 

WSJ 

ZF 

AP 

FR 

WSJ 

ZF 

SV 

0.3237 

0.0949 

0.0630 

0.2740 

0.0936 

0.3068 

0.0650 

0.2259 

0.1166 

0.2035 

LV 

0.3326 

0.0697 

0.1018 

0.2848 

0.0997 

0.2981 

0.0602 

0.2483 

0.1045 

0.2159 

Pnl.O 

0.3340 

0.0831 

0.1777 

0.3153 

0.1292 

0.3133 

0.1927 

0.2838 

0.1722 

0.2205 

Pnl.5 

0.3682 

0.0814 

0.1874 

0.3332 

0.1430 

0.3438 

0.1982 

0.2941 

0.1964 

0.2543 

Pn2.0 

0.3647 

0.0750 

0.1761 

0.3290 

0.1307 

0.3419 

0.1995 

0.2828 

0.2018 

0.2573 

CombSUM 

0.4153 

0.1038 

0.2133 

0.3778 

0.1657 

0.3959 

0.2000 

0.3561 

0.2200 

0.3206 

Chg/Max 

12.8% 

9.4% 

13.8% 

13.4% 

15.9% 

15.1% 

0.2% 

21.0% 

9% 

24.6% 

Exact  R-Precision 


Run 

Disk  1 

Disk  2 

Both 
Disks 

AP 

DOE 

FR 

WSJ 

ZF 

AP 

FR 

WSJ 

ZF 

SV 

0.3351 

0.0947 

0.0567 

0.2881 

0.1086 

0.3052 

0.0718 

0.2502 

0.1130 

0.2649 

LV 

0.3385 

0.0714 

0.1006 

0.3022 

0.1055 

0.3019 

0.0540 

0,2712 

0.1036 

0.2661 

Pnl.O 

0.3495 

0.0989 

0.1434 

0.3322 

0.1190 

0.3139 

0.1653 

0.2934 

0.1577 

0.2867 

Pnl.5 

0.3721 

0.0899 

0.1582 

0.3440 

0.1277 

0.3465 

0.1716 

0.3044 

0.1796 

0.3203 

Pn2.0 

0.3735 

0.0925 

0.1579 

0.3389 

0.1154 

0.3401 

0.1640 

0.2953 

0.1979 

0.3233 

CombSUM 

0.4137 

0.1156 

0.1938 

0.3757 

0.1389 

0.3712 

0.1741 

0.3385 

0.1909 

0.3711 

Chg/Max 

10.8% 

16.9% 

22.5% 

9.2% 

8.8% 

7.1% 

1.4% 

11.2% 

-3.5% 

14.8% 
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Table  11:  Per- Topic  comparison  of  Average  Precisions  (Ad-hoc  Topics  51-100). 


Topic 

»  V 

TIT" 

rnl.U 

Jrnl.o 

CombhU  JVl 

Ung/  Max 

L/iig/ 1  nz.U 

74 

273 

0.0000 

0.0099 

0.0001 

0.0002 

0.0002 

0.0002 

-97.98% 

0.00% 

90 

206 

0.1248 

0.1134 

0.0082 

0.0036 

0.0015 

0.0089 

-92.87% 

493.33% 

77 

138 

0.1100 

0.2241 

0.0093 

0.0196 

0.0246 

0.0411 

-81.66% 

67.07% 

91 

40 

0.1648 

0.1039 

0.0012 

0.0038 

0.0068 

0.0308 

-81.31% 

352.94% 

72 

119 

0.1099 

0.1173 

0.0076 

0.0116 

0.0138 

0.0323 

-72.46% 

134.06% 

73 

183 

0.0231 

0.0124 

0.0017 

0.0052 

0.0120 

0.0098 

-57.58% 

-18.33% 

94 

310 

0.0176 

0.0185 

0.0191 

0.1094 

0.2100 

0.1043 

-50.33% 

-50.33% 

85 

894 

0.1587 

0.1779 

0.0239 

0.0433 

0.0518 

0.1018 

-42.78% 

96.53% 

67 

92 

0.0011 

0.0012 

0.0435 

0.0487 

0.0502 

0.0290 

-42.23% 

-42.23% 

55 

810 

0.2362 

0.4535 

0.1153 

0.1266 

0.1256 

0.2874 

-36.63% 

128.82% 

64 

375 

0.1493 

0.1641 

0.0740 

0.0673 

0.0706 

0.1097 

-33.15% 

55.38% 

57 

461 

0.0282 

0.0605 

0.5464 

0.4518 

0.3447 

0.3786 

-30.71% 

9.83% 

80 

374 

0.0218 

0.0266 

0.1007 

0.1125 

0.1025 

0.0899 

-20.09% 

-12.29% 

82 

602 

0.2544 

0.3139 

0.2603 

0.1746 

0.1278 

0.2521 

-19.69% 

97.26% 

97 

352 

0.0513 

0.0513 

0.1133 

0.1376 

0.1405 

0.1190 

-15.30% 

-15.30% 

76 

294 

0.1481 

0.1677 

0.1221 

0.0966 

0.0755 

0.1440 

-14.13% 

90.73% 

89 

174 

0.0365 

0.0762 

0.0481 

0.1159 

0.1515 

0.1306 

-13.80% 

-13.80% 

98 

666 

0.0822 

0.0677 

0.0421 

0.0579 

0.0703 

0.0711 

-13.50% 

1.14% 

84 

396 

0.1659 

0.0985 

0.0787 

0.1251 

0.1559 

0.1471 

-11.33% 

-5.64% 

99 

288 

0.4834 

0.4796 

0.3529 

0.3551 

0.3224 

0.4480 

-7.32% 

38.96% 

58 

159 

0.0971 

0.0620 

0.3459 

0.4198 

0.4072 

0.3937 

-6.22% 

-3.32% 

92 

88 

0.0253 

0.0405 

0.0130 

0.0154 

0.0176 

0.0381 

-5.93% 

116.48% 

71 

380 

0.0595 

0.0613 

0.1396 

0.1978 

0.2210 

0.2156 

-2.44% 

-2.44% 

88 

165 

0.1813 

0.2341 

0.3629 

0.3635 

0.3542 

0.3550 

-2.34% 

0.23% 

59 

579 

0.2368 

0.2272 

0.1546 

0.3827 

0.4370 

0.4282 

-2.01% 

-2.01% 

93 

171 

0.0344 

0.0385 

0.5233 

0.4831 

0.3779 

0.5141 

-1.76% 

36.04% 

95 

263 

0.0149 

0.0228 

0.1450 

0.2072 

0.2392 

0.2375 

-0.71% 

-0.71% 

52 

535 

0.4918 

0.4462 

0.4896 

0.6623 

0.6882 

0.6864 

-0.26% 

-0.26% 

78 

162 

0.2202 

0.2632 

0.7696 

0.7484 

0.7250 

0.7705 

0.12% 

6.28% 

61 

206 

0.4106 

0.5130 

0.3646 

0.3786 

0.3621 

0.5167 

0.72% 

42.70% 

51 

138 

0.2799 

0.3957 

0.4751 

0.5282 

0.5428 

0.5476 

0.88% 

0.88% 

70 

55 

0.1156 

0.1198 

0.7700 

0.7919 

0.7982 

0.8080 

1.23% 

1.23% 

54 

171 

0.3063 

0.3045 

0.3950 

0.3207 

0.2673 

0.4048 

2.48% 

51.44% 

62 

298 

0.1674 

0.1907 

0.1519 

0.2087 

0.2121 

0.2198 

3.63% 

3.63% 

81 

62 

0.1759 

0.1410 

0.2287 

0.2370 

0.2321 

0.2467 

4.09% 

6.29% 

68 

195 

0.0960 

0.0920 

0.1040 

0.2098 

0.2540 

0.2651 

4.37% 

4.37% 

69 

52 

0.0956 

0.1629 

0.5382 

0.5873 

0.5833 

0.6227 

6.03% 

6.75% 

53 

571 

0.1821 

0.1874 

0.1543 

0.2928 

0.3241 

0.3461 

6.79% 

6.79% 

83 

633 

0.2673 

0.2931 

0.1753 

0.2412 

0.2666 

0.3317 

13.17% 

24.42% 

56 

878 

0.3011 

0.3277 

0.3391 

0.3089 

0.2691 

0.3955 

16.63% 

46.97% 

100 

317 

0.1904 

0.1972 

0.1423 

0.1815 

0.2094 

0.2516 

20.15% 

20.15% 

65 

386 

0.0100 

0.0078 

0.1111 

0.1190 

0.1236 

0.1529 

23.71% 

23.71% 

86 

213 

0.5146 

0.4242 

0.5216 

0.5234 

0.4891 

0.6624 

26.56% 

35.43% 

60 

60 

0.0547 

0.0887 

0.0866 

0.0960 

0.0992 

0.1259 

26.92% 

26.92% 

79 

232 

0.1376 

O.lzzO 

0.2057 

0.2784 

A   O'^  1  A 

0.2719 

A    O     A  A 

0.3690 

32.54% 

35.71%) 

7^ 
1  o 

ouo 

n  nni  fi 

U.UUiO 

fi  nn^7 

U.UUo  ( 

u.uioi 

U  .UOO'4: 

•^7  07% 
O  1  .  U  1  /o 

40  9n% 

63 

208 

0.0032 

0.0028 

0.0631 

0.0597 

0.0688 

0.0972 

41.28% 

41.28% 

66 

197 

0.0001 

0.0000 

0.0320 

0.0388 

0.0478 

0.0695 

45.40% 

45.40% 

87 

188 

0.0436 

0.0559 

0.0410 

0.0914 

0.1256 

0.1867 

48.65% 

48.65% 

96 

693 

0.0095 

0.0084 

0.0833 

0.1177 

0.1302 

0.2356 

80.95% 

80.95% 

Avg 

15667 

0.1418 

0.1555 

0.1988 

0.2242 

0.225 

0.262 

16.44% 

16.44% 

251 


Table  12:  Per- Topic  comparison  of  Average  Precisions  (Ad-hoc  Topics  101-150). 


Topic 

#Rel 

SV 

LV 

Pnl.O 

Pnl.5 

Pn2.0 

CombSUM 

Chg/Max 

Chg/Pn2.0 

103 

94 

0.2400 

0.2379 

0.0130 

0.0225 

0.0309 

0.0532 

-77.83% 

72.17% 

122 

114 

0.1467 

0.1059 

0.0547 

0.0490 

0.0544 

0.0898 

-38.79% 

65.07% 

101 

57 

0.2232 

0.1932 

0.0900 

0.1088 

0.1175 

0.1482 

-33.60% 

26.13% 

140 

25 

0.0150 

0.0520 

0.0111 

0.0207 

0.0315 

0.0356 

-31.54% 

13.02% 

135 

400 

0.4072 

0.4449 

0.1103 

0.1121 

0.0835 

0.3122 

-29.83% 

273.89% 

104 

75 

0.1867 

0.2214 

0.0715 

0.0642 

0.0828 

0.1671 

-24.53% 

101.81% 

121 

55 

0.0017 

0.0028 

0.0283 

0.0542 

0.0628 

0.0475 

-24.36% 

-24.36% 

127 

223 

0.2411 

0.2476 

0.0543 

0.0942 

0.1028 

0.1911 

-22.82% 

85.89% 

143 

397 

0.4069 

0.3978 

0.1628 

0.1968 

0.2022 

0.3257 

-19.96% 

61.08% 

124 

173 

0.0201 

0.0165 

0.1711 

0.1360 

0.1068 

0.1501 

-12.27% 

40.54% 

146 

358 

0.6795 

0.7017 

0.4394 

0.4775 

0.4907 

0.6255 

-10.86% 

27.47% 

114 

138 

0.0518 

0.0756 

0.2381 

0.2559 

0.2615 

0.2525 

-3.44% 

-3.44% 

112 

291 

0.0041 

0.0188 

0.3876 

0.3515 

0.3194 

0.3779 

-2.50% 

18.32% 

130 

286 

0.2301 

0.2937 

0.4184 

0.5477 

0.5749 

0.5632 

-2.04% 

-2.04% 

102 

64 

0.2127 

0.2308 

0.1119 

0.1373 

0.1505 

0.2287 

-0.91% 

51.96% 

128 

381 

0.0915 

0.0751 

0.2583 

0.3710 

0.4190 

0.4156 

-0.81% 

-0.81% 

150 

458 

0.4618 

0.4994 

0.3362 

0.4432 

0.4566 

0.4985 

-0.18% 

9.18% 

115 

165 

0.4053 

0.4383 

0.3262 

0.3633 

0.3720 

0.4407 

0.55% 

18.47% 

113 

206 

0.0531 

0.0768 

0.2700 

0.3304 

0.3030 

0.3367 

1.91% 

11.12% 

119 

326 

0.0644 

0.0843 

0.2442 

0.2252 

0.2018 

0.2497 

2.25% 

23.74% 

107 

98 

0.1101 

0.1580 

0.3156 

0.4062 

0.4451 

0.4592 

3.17% 

3.17% 

116 

49 

0.2635 

0.1886 

0.3008 

0.2803 

0.2487 

0.3125 

3.89% 

25.65% 

118 

273 

0.0259 

0.0164 

0.1551 

0.1736 

0.1725 

0.1806 

4.03% 

4.70% 

145 

162 

0.3218 

0.2983 

0.2169 

0.2180 

0.1961 

0.3359 

4.38% 

71.29% 

138 

52 

0.0895 

0.0992 

0.1169 

0.1321 

0.1322 

0.1386 

4.84% 

4.84% 

148 

250 

0.7256 

0.7300 

0.7316 

0.7904 

0.8010 

0.8416 

5.07% 

5.07% 

108 

294 

0.0994 

0.1266 

0.3089 

0.2586 

0.1820 

0.3264 

5.67% 

79.34% 

136 

206 

0.1713 

0.1976 

0.3943 

0.5899 

0.5894 

0.6319 

7.12% 

7.21% 

134 

188 

0.4033 

0.4132 

0.5091 

0.4897 

0.4675 

0.5500 

8.03% 

17.65% 

132 

201 

0.1979 

0.3246 

0.5671 

0.5759 

0.5625 

0.6222 

8.04% 

10.61% 

133 

80 

0.2396 

0.2793 

0.1541 

0.2322 

0.2425 

0.3034 

8.63% 

25.11% 

147 

315 

0.1371 

0.1266 

0.2312 

0.2820 

0.3001 

0.3317 

10.53% 

10.53% 

106 

201 

0.0354 

0.0458 

0.2784 

0.3463 

0.3653 

0.4039 

10.57% 

10.57% 

126 

240 

0.2766 

0.2990 

0.1997 

0.3132 

0.3548 

0.3971 

11.92% 

11.92% 

125 

169 

0.1632 

0.1719 

0.2580 

0.2224 

0.1842 

0.2920 

13.18% 

58.52% 

137 

158 

0.3365 

0.4135 

0.2339 

0.3128 

0.3555 

0.4715 

14.03% 

32.63% 

131 

28 

0.0659 

0.0648 

0.0671 

0.0984 

0.1061 

0.1255 

18.28% 

18.28% 

110 

496 

0.4909 

0.4968 

0.4845 

0.4768 

0.4336 

0.6001 

20.79% 

38.40% 

142 

660 

0.4269 

0.4479 

0.4060 

0.4694 

0.4629 

0.5721 

21.88% 

23.59% 

117 

275 

0.1165 

0.1030 

0.1296 

0.1596 

0.1637 

0.2007 

22.60% 

22.60% 

149 

133 

0.0309 

0.0836 

0.0490 

0.0732 

0.0769 

0.1028 

22.97% 

33.68% 

129 

207 

0.3391 

0.2338 

0.2402 

0.3348 

0.3477 

0.4369 

25.65% 

25.65% 

144 

49 

0.2490 

0.2275 

0.0832 

0.1327 

0.1958 

0.3130 

25.70% 

59.86% 

105 

54 

0.0656 

0.1744 

0.1304 

0.1439 

0.1441 

0.2262 

29.70% 

56.97% 

139 

55 

0.0911 

0.1139 

0.0592 

0.0705 

0.0776 

0.1481 

30.03% 

90.85% 

111 

285 

0.3795 

0.3540 

0.1991 

0.3025 

0.3756 

0.5025 

32.41% 

33.79% 

123 

435 

0.0697 

0.0800 

0.2254 

0.2252 

0.2043 

0.3014 

33.72% 

47.53% 

120 

83 

0.0165 

0.0136 

0.0490 

0.0556 

0.0538 

0.0746 

34.17% 

38.66% 

109 

742 

0.0875 

0.0879 

0.0811 

0.1344 

0.1500 

0.2290 

52.67% 

52.67% 

141 

36 

0.0084 

0.0125 

0.0538 

0.0517 

0.0514 

0.0873 

62.27% 

69.84% 

Avg 

10760 

0.2035 

0.2159 

0.2205 

0.2543 

0.2573 

0.3206 

24.60% 

24.60% 
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Machine  Learning  for  Knowledge-Based  Document  Routing 
(A  Report  on  the  TREC-2  Experiment) 


Richard  M.  Tong,  Lee  A.  Appelbaum 

Advanced  Decision  Systems 
(a  division  of  Booz» Allen  &  Hamilton,  Inc.) 
1500  Plymouth  Street,  Mountain  View,  CA  94043 


1  Introduction 

This  paper  contains  a  description  of  the  experi- 
ments performed  by  Advanced  Decision  Systems  as  part  of 
the  Second  Text  Retrieval  Conference  (TREC-2).^ 

The  overall  system  we  have  developed  for  TREC-2 
demonstrates  how  we  can  combine  statistically-oriented 
machine  learning  techniques  with  a  commercially  available 
knowledge-based  information  retrieval  system.  As  in  TREC- 
1,  the  tool  we  using  for  the  fully  automatic  construction  of 
routing  queries  is  based  on  the  Classification  and  Regression 
Trees  (CART)  algorithm.^  However,  in  a  departure  from 
TREC-1,  we  have  expanded  our  definition  of  what  consti- 
tutes a  document  "feature"  within  the  CART  algorithm,  and 
also  explored  how  the  CART  output  can  be  used  as  the  basis 
of  topic  definitions  that  can  be  interpreted  by  the  TOPIC® 
retrieval  system  developed  by  Verity,  Inc.  of  Mountain  View, 
CA. 

Section  2  of  the  paper  contains  a  review  of  the 
CART  algorithm  itself  and  the  data  structures  it  produces. 
Section  3  of  the  paper  describes  the  two  basic  algorithms  we 
have  devised  for  converting  CART  output  into  TOPIC  read- 
able files.  Section  4  of  the  paper  contains  a  description  of  our 
experimental  procedure  and  an  analysis  of  the  official 
results,  as  well  as  data  from  a  series  of  auxiliary  experi- 
ments. Section  5  contains  some  general  conunents  on  overall 
performance.  We  conclude  in  Section  6  with  a  brief  discus- 
sion of  possible  future  research  directions. 


1.  Requests  for  further  information  about  the  TREC-2  experiments 
should  be  directed  to  the  authors  at  the  address  above,  or  electroni- 
cally to  either  rtong@ads  .  com  or  lee@ads  .  com. 

2.  A  comprehensive  discussion  of  the  CART  algorithm  can  be 
found  in  [1].  Details  of  previous  work  on  the  use  of  CART  for 
information  retrieval  are  presented  in  [2]  and  in  [3] . 


2  The  CART  Algorithm 

CART  has  been  shown  to  be  useful  when  one  has 
access  to  datasets  describing  known  classes  of  observations, 
and  wishes  to  obtain  rules  for  classifying  future  observations 
of  unknown  class — exactly  as  in  the  document  routing  prob- 
lem. CART  is  particularly  attractive  when  the  dataset  is 
"messy"  (i.e.,  is  noisy  and  has  many  missing  values)  and 
thus  unsuitable  for  use  with  more  traditional  classification 
techniques.  In  addition,  and  particularly  important  for  the 
document  routing  application,  if  it  is  important  to  be  able  to 
specify  both  the  misclassification  costs  and  the  prior  proba- 
bilities of  class  membership  then  CART  has  a  direct  way  of 
incorporating  such  information  into  the  tree  building  pro- 
cess. Finally,  CART  can  generate  auxiUary  information,  such 
as  the  expected  misclassification  rate  for  the  classifier  as  a 
whole  and  for  each  terminal  node  in  the  tree,  that  is  useful 
for  the  document  routing  problem. 

2.1  CART  Processing  Flow 

Figure  1  shows  how  the  CART  algorithm  is  used  to 
construct  the  optimal  classification  tree  based  on  the  training 
data  provided  to  it.The  diagram  shows  the  four  sub-pro- 
cesses used  to  generate  the  optimal  tree,  T*.  The  "raw"  train- 
ing data  (i.e.,  the  original  texts  of  the  articles),  together  with 
the  class  specifications  (i.e.,  the  training  data  relevance  judg- 
ments) and  the  feature  specifications  (i.e.,  the  words  defined 
to  be  features),  are  input  to  the  Feature  Extraction  Module. 
The  output  is  a  set  of  vectors  that  record  the  class  member- 
ship and  the  features  contained  in  the  training  data.  These 
vectors,  together  with  class  priors  and  the  cost  function 
(these  are  optional),  are  input  to  the  Tree  Growing  Module 
which  then  constructs  the  maximal  tree  (T^ja,)  that  charac- 
terizes the  training  data.  Since  this  tree  overfits  the  data,  the 
next  step  is  to  construct  a  series  of  nested  sub-trees  by  prun- 
ing Tn^ax  to  the  root.  The  sequence  of  sub-trees  (Tj^^x  >  > 
...  >T„)  are  input  to  the  Tree  Selection  Module  which  per- 
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Figure  1:  CART  Processing  Flow 


forms  a  cross-validation  analysis  on  the  sequence  and 
selects  that  tree  with  the  lowest  cross-validation  error.^  This 
isT*. 

In  our  TREC-2  system,  we  take  T*  and  convert  it 
into  a  TOPIC  outline  file  as  described  in  Section  3.  That  is, 
rather  than  use  CART  itself  to  perform  the  routing  on  the 
unseen  documents  (as  we  did  in  TREC-1),  we  use  the 
CART  trees  as  skeletons  for  TOPIC  concepts,  and  then  have 
TOPIC  do  the  routing.  An  advantage  of  this  is  that  we  can 
make  use  of  TOPIC'S  extensively  optimized  text  database 
capabilities,  thus  allowing  us  to  easily  generate  the  output 
files  needed  for  the  official  scoring  program. 

2.2  Data  Structures  Generated  by  CART 

To  illustrate  the  processing  that  CART  performs, 
we  will  use  an  example  taken  from  the  TREC-2  corpus.  A 
example  query  is  as  follows: 

<top> 

<head>  Tipster  Topic  Description 
<num>  Number:  097 

<dom>  Domain:   Science  and  Technology 
<title>  Topic:   Fiber  Optics  Applications 

<desc>  Description: 

Document  must  identify  instances  of  fiber 
optics  technology  actually  in  use. 


To  be  relevant,   a  document  must  describe 
actual  operational  situations  in  which 
fiber  optics  are  being  employed,   or  will 
be  employed.  A  document  describing  future 
fiber  optics  use  will  be  relevant  only  if 
contracts  have  been  signed  concerning  the 
future  application. 

<con>  Concept (s) : 

1.  fiber  optic,  light 

2.  telephone,    LAN,  television 

<f ac>  Factor ( s ) : 
<def>  Definition (s) : 

1.   Fiber  optics  refers  to  technology  in 
which  information  is  passed  via  laser 
light  transmitted  through  glass  or  plastic 
fibers . 
<\top> 

This  is  a  very  comprehensive  statement  of  infor- 
mation need  and  provides  a  rich  set  of  features  that  we  can 
use  for  CART.'*  Our  basic  procedure  is  to  extract  from  the 
information  need  statement  all  the  unique  content  words 
and  then  stem  them,  which  gives  the  following  list: 

ACT  APPL  CONCERN  CONTRACT  DESCRIB  DOCU 
EMPLOY  FIB  FUT  GLASS  INFORM  LAN  LAS  LIGHT 
OPER  OPT  PAS  PLAST  REF  RELEV  SIGN  SITU 
TECHNOLOG  TELEPHON  TELEV  TRANSMIT  VIA^ 


<narr>  Narrative: 


3.  The  algorithm  actually  minimizes  with  respect  to  both  the 
cross-validation  error  and  the  tree  complexity.  So  that  if  two  trees 
have  statistically  indistinguishable  error  rates,  then  the  smaller  of 
the  two  trees  will  be  selected  as  optimal. 


4.  In  general,  determining  what  set  of  features  to  use  for  CART  is 
a  matter  of  experience  and  judgement.  For  the  TREC-2  corpus  we 
have  made  use  of  the  information  need  statements,  but  other 
approaches,  such  as  using  all  the  unique  words  in  the  training  set, 
are  equally  vaUd.  In  fact,  it  is  this  freedom  of  choice  of  features 
that  gives  CART  a  great  deal  of  its  flexibility. 
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Given  this  list  of  features  we  can  generate  tlie 
CART  training  vectors  by  testing  all  the  documents  in  the 
training  set  for  the  presence  of  absence  of  the  feature.^  The 
resulting  vectors,  together  with  the  ground  truth  classifica- 
tion, are  the  input  to  CART.  The  output  from  CART  con- 
tains information  about  the  sequence  of  nested  decision 
trees  and  a  specification  of  the  optimal  tree. 

For  our  example,  the  maximal  tree  is  shown  in 
Figure  2.  This  tree  is  to  be  read  from  left  to  right.  Thus  the 


Associated  with  each  decision  node  is  information 


that  tells: 


what  level  in  the  tree  the  test  occurs;  the  k  value  — 
in  all  cases  k  ranges  from  k=l,  which  corresponds  to 
the  maximal  tree,  to  ]<i='k^^y,  which  corresponds  to 
the  null  tree  and  where  kf^ia^  is  dependent  on  the 
actual  tree  grown  by  CART; 


class  0  (k;= 
CONCERN<=0 . 50  ( 
class 
TRANSMIT<= 
class 

GLASS<=0.50  {k=5,ac= 
class 
TRANSMIT<= 
class 

CONCERN<=0.50  ( 
class  0  (k 
VIA<=0.50  (k=5,ac=0,Rt=0. 

class  1  (k=4,Rt 
TRANSMIT<=0.50  (k=5, 
class  0  (k=4,Rt 
SIGN<=0.50  {k=5,ac=0,Rt=0.3454 
class  1    (k=5,Rt=0. 003917) 
LIGHT<=0 . 50    (k=6,ac=0, Rt=0 . 37  6884) 
class  0    (k=3,Rt=0. 000000) 
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Figure  2:  Maximal  Tree 


first  test  is  on  the  presence  of  the  stem  contract  .  If  the 
stem  is  present  (i.e.,  if  the  test  CONTRACT<=0  .5  fails) 
then  the  tree  branches  downwards  (this  is  leftward  in  CART 
jargon);  if  the  stem  is  not  present  (i.e.,  the  test  succeeds) 
then  the  tree  branches  upwards  (rightward). 


5.  We  only  use  the  <nan>,  <con>  and  <defi>  flelds,  and  the  stop 
word  list  is  taken  from  [4].  The  stemming  algorithm  is  taken  from 
[5]. 

6.  In  our  TREC-1  experiments  we  also  made  use  of  the  frequency 
of  occurrence  of  the  features  in  the  documents.  Since  our  ultimate 
goal  here  is  to  generate  TOPIC  trees  we  restrict  ourselves  to  just 
testing  for  the  presence  or  absence  of  the  feature  —  this  allows  us 
to  perform  a  straightforward  conversion  between  CART  and 
TOPIC. 


•  the  class  to  be  assigned  to  this  node  if  the  tree  were 
pruned  here;  the  ac  value  —  in  the  present  case 
ac=l  for  the  class  that  corresponds  to  a  relevant  doc- 
ument and  ac=0  for  the  class  that  corresponds  to  a 
non-relevant  document;  and 

•  the  error  rate  of  the  node;  the  Rt  value  —  this  indi- 
cates the  resubstitution  error  rate  for  the  specific 
node. 

The  terminal  nodes  in  the  tree  have  similar  infor- 
mation associated  with  them.  Notice  though  that  termmal 
nodes,  by  definition,  specify  to  which  class  the  node  corre- 
sponds. 
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CART  also  generates  the  following  table  of  error 
estimates  that  are  used  in  selecting  the  optimal  tree.  Here  k 
is  the  k- value,  I T I  is  the  size  of  the  tree  (measured  in  terms 
of  the  number  of  terminals),  R  ( T )  is  the  resubstitution  rate 
for  the  overall  tree,  and  Rev  ( T )  is  the  cross-validation  rate 
for  the  overall  tree.  Recall  that  CART  minimizes  with 
respect  to  both  size  and  cross-validation  rate,  so  that  for  our 
example,  the  optimal  tree  is  when  k=4. 

^  j^(,p)        Rcv(T)  so  that  the  optimal  tree  is  as  shown  in  Figure  4. 
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So,  pruning  the  maximal  tree  so  that  all  nodes  have  k>4 
gives  us  the  following  optimal  tree  shown  in  Figure  3. 


3  Transforming  CART  Trees  into  TOPIC 
Outline  Specifications 

The  trees  produced  by  CART  are  not  directly 
usable  by  TOPIC  because  they  represent  the  information  we 
need  (i.e.,  the  decision  function  and  the  associated  decision 
variables)  in  a  form  that  is  incompatible  with  the  TOPIC 


class  0   (k=5,Rt=0. 261725) 
VIA<  =  0.50    (k;=5,ac=0,Rt=0. 303601) 
class  1    (k=5,RC=0. 007834) 
SIGN<=0.50   (k=5,ac=0,Rt =0.34 5477) 
class  1    (k=5,Rt=0. 003917) 
LIGHT<=0 . 50    (k=6, ac=0, RC=0 .37  6884) 
class  0    {k=5,RC=0. 031407) 
CONTRACT<=0 . 50    (k=7 , ac=l , Rt=0 .497487) 
class  1    (k=6,Rt=0. 027421) 


Figure  3:  Optimal  Tree  Using  Stemmed  Features 


Notice  that  by  pruning  away  the  deeper  nodes  in  the  tree  we 
are  left  with  a  tree  that  tests  on  just  five  of  the  original  26 
stems,  and  has  three  paths  that  lead  toclass-1  nodes  (i.e., 
a  decision  that  the  document  is  relevant). 

The  unstemmed  version  of  our  procedure  is  identi- 
cal except  that  we  do  no  stemming  of  the  unique  words 
extracted  from  the  information  need  statement.  For  the 
example,  this  produces  the  following  list  of  features: 

ACTUAL  APPLICATION  CONCERNING  CONTRACTS 
DESCRIBE  DESCRIBING  DOCUMENT  EMPLOYED 
FIBER  FIBERS  FUTURE  GLASS  INFORMATION  LAN 
LASER  LIGHT  OPERATIONAL  OPTIC  OPTICS 
PASSED  PLASTIC  REFERS  RELEVANT  SIGNED 
SITUATIONS  TECHNOLOGY  TELEPHONE  TELEVISION 
TRANSMITTED  VIA 

and  the  following  error  table: 

k        ITI  R(T)  Rcv(T) 


1  36        0.0932  0.3168 

2  33         0.0972  0.2763 


knowledge-representation.  The  main  problem  then  is  to 
define  a  transformation  from  the  CART  representation  to  the 
TOPIC  representation  that  at  least  preserves  the  decision 
information  and  perhaps  augments  it  so  that  we  get 
improved  routing  performance. 

In  this  section  we  explore  two  possible  strategies  for  con- 
structing these  transformations.  The  first  is  a  strict  re-coding 
of  the  information  in  the  CART  tree.  The  second  generalizes 
the  intent  of  the  CART  tree  but  adds  no  new  information. 
Both  of  these  techniques  are  "automatic,!'  in  the  sense  that, 
once  various  parameters  have  been  chosen,  the  algorithms 
work  without  human  intervention. 

3.1  First  Canonical  Form 

The  first  canonical  from  completely  preserves  the 
decision  function  generated  by  CART  and  involves  a  simple 
mapping  of  the  CART  tree  into  TOPIC  outline  file  format.^ 
To  do  this,  we  begin  by  observing  that  each  path  in  the 
CART  tree  from  the  root  toaclass-1  leaf  node  consti- 
tutes a  conjunction  of  tests  of  decision  variables.  Since  we 
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class  0   (k=10,Rt=0. 062814) 
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TRANSMITTED<=0. 50    ( k=9 , ac=0 , Rt =0 . 0 1 04 69 ) 
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ACTUAL<=0. 50    (k=9 , ac=0 , Rt =0 . 03 1407 ) 
class  1    (k=9, Rt=0 . 000000) 


Figure  4:  Optimal  Tree  Using  Unstemmed  Features 


have  constrained  the  tests  to  be  of  the  form  "Is  word  X 
present  or  not?"  we  can  easily  model  this  conjunction  in 
TOPIC  using  AND  and  NOT  operators.  Since  there  will  gen- 
erally be  multiple  paths  toclass-1  leaf  nodes,  the  algo- 
rithm combines  the  separate  paths  in  the  TOPIC  definition 
using  the  OR  operator. 

To  illustrate,  the  optimal  tree  for  our  example  con- 
tains the  following  conjunctive  descriptions  of  class-1  leaf 
nodes: 

1.  CONTRACT 

2  .  SIGN  and  not  LIGHT  and  not  CONTR/.CT 
3.  VIA  and  not  SIGN  and  not  LIGHT  and  not 
CONTRACT 

which  leads  directly  to  a  TOPIC  outline  of  the  form: 

Topic_097  <0r> 

*  0.77  TopicStyle_097_l  <0r> 

**  0.99  TopicPath_097_l-l  <And> 
***     ~  'CONTRACT' 
***     ~  'LIGHT' 
***     ~  'SIGN' 
***  'VIA' 

**  1.00  TopicPath_097_l-2  <And> 
***     ~  'CONTRACT' 
***     ~  'LIGHT' 
***  'SIGN' 

**  0.97  TopicPath_097_l-3  <And> 

*  *  *     ' CONTRACT ' 

Here  we  use  a  notation  for  topic  names  that  ensures 
uniqueness.  Notice  that  in  this  model  we  use  the  resubstitu- 
tion  rates  (actually  l-Rj)  to  define  a  weight  for  the  individ- 
ual conjuncts  and  the  overall  cross-validation  rate  (actually 
1-Rcv)  as  the  weight  for  the  disjuncts.  Note  also  that  since 
TOPIC  only  uses  weights  defined  to  two  decimal  places  we 
have  rounded  the  weights  derived  from  the  CART  tree. 


7.  TOPIC  uses  a  representation  for  concepts  that  can  be  recorded 
in  so-called  outline  format.  A  collection  of  such  specifications  is 
used  by  TOPIC  to  built  the  concept  trees  used  for  retrieval. 


3.2  Second  Canonical  Form 

The  second  canonical  form  maJces  use  of  the  deci- 
sion variables  chosen  by  CART  but  not  the  actual  decision 
function.  This  model  is  based  on  two  observations: 

•  first,  that  the  set  of  variables  used  by  CART  are, 
when  taken  as  a  whole,  indicative  of  the  general  topic 
of  the  information  need  statement,  and 

•  second,  that  every  variable  used  in  the  tree  is  on  the 
path  to  at  least  one  class-0  node  and  at  least  one 
class-1  node. 

Thus,  from  an  information  retrieval  perspective,  all 
the  decision  variables  have  some  contribution  to  make  to  an 
assessment  of  the  relevance  of  a  document.  So  rather  than 
use  the  specific  decision  function  constructed  by  CART  we 
can  replace  this  with  one  that  can  be  thought  of  as  'The 
more  features  present  in  a  document  the  better."  This  is 
modelled  straightforwardly  in  TOPIC  using  the  ACCRUE 
operator. 

To  illustrate,  the  maximal  tree  for  our  example 
contains  the  following  set  of  decision  variables: 

CONCERN  CONTRACT  GLASS  LIGHT  SIGN  TRANSMIT 
VIA 

which  leads  directly  to  a  TOPIC  outline  of  the  form: 

Topic_097  <0r> 

*     0.70  TopicStyle_097_2  <Accrue> 

**     0.2  5  'CONCERN' 

**     0.7  5  'CONTRACT' 

**     0.75  'GLASS' 

**     0.7  5  'LIGHT' 

**     0.75  'SIGN' 

**     0.7  5  'TRANSMIT' 

**     0.75  'VIA' 

The  weighting  scheme  we  have  used  gives  higher  values  to 
variables  (features)  that  are  in  the  optimal  tree,  intermediate 
values  to  variables  on  the  fringe  of  the  optimal  tree  (there 
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are  none  of  these  in  the  current  example),  and  lower  values 
to  features  outside  the  optimal  tree.^  At  this  point  the  spe- 
cific values  chosen  represent  our  "best  guess"  at  a  weighting 
scheme,  further  experimentation  will  undoubtedly  reveal  a 
better  strategy.  As  in  the  first  canonical  form,  the  overall 
weight  for  the  TOPIC  tree  is  based  on  the  cross-validation 
rate  for  the  maximal  tree. 


4  The  TREC-2  Experiments 

For  TREC-2  we  again  focused  only  on  the  docu- 
ment routing  problem.  Since  our  technique  requires  training 
data  it  does  not  easily  lend  itself  to  the  ad  hoc  retrieval 
problem  and  so  rather  than  "force-fit"  it  we  chose  to  gener- 
ate four  sets  of  results  for  the  routing  queries  (topics  51- 
100).  Each  set  of  results  was  generated  totally  automati- 
cally. The  results  sets  are  labelled  adsl,  ads2,  ads3,  and 
ads4,  and  the  table  below  shows  to  which  combuiations  of 
features  and  TOPIC  models  they  correspond. 


Table  1:  Results  Identification 


Result  Set 

Word  Features 

TOPIC  Model 

adsl 

stemmed 

model- 1 

ads2 

unstemmed 

model- 1 

ads3 

stemmed 

model-2 

ads4 

unstenmied 

model-2 

Although  we  generated  four  sets  of  results,  the 
resource  constraints  at  NIST  resulted  in  only  adsl  and  ads2 
being  officially  scored.  Reference  in  the  remainder  of  the 
paper  to  scores  associated  with  ads3  and  ads4  are  to  the 
unofficial  score  generated  by  us  using  the  TREC-2  scoring 
program  and  the  published  qrels  for  the  routing  topics. 

4.1  The  Experimental  Procedure 

The  experimental  procedure  for  TREC-2  consists 
of  five  basic  steps.  We  briefly  describe  each  of  these: 

•  First,  we  generated  the  CART  training  data  from  the 
information  need  statements  and  the  ground  truth 
files  (i.e.,  the  qrels)  provided  by  NIST.  This  produced 
two  feature  sets  for  each  topic  (corresponding  to  the 
stemmed  and  unstemmed  versions  of  the  features). 


8.  A  variable  is  in  the  optimal  tree  if  its  k-value  is  greater  than  k*; 
is  on  the  fringe  if  k=k*;  and  outside  the  optimal  tree  if  k<k*.  Note 
that  in  general  the  individual  features  appear  at  multiple  locations 
in  the  tree.  Our  strategy  is  to  remove  duplicates  by  retaining  the 
instance  with  the  highest  k-value. 


and  two  sets  of  training  vectors  labelled  with  the 
ground  truth  information.^  Since  CART  is  a  statisti- 
cally-oriented classifier,  we  decided  to  minimize  the 
"noise"  in  the  training  sets  by  using  only  the  Wall 
Street  Journal  articles  identified  in  the  qrel  files.  Fur- 
ther, for  all  but  topics  80  and  81,  we  used  just  the 
Wall  Street  Journal  articles  on  Disk  2. 

•  Second,  we  grew  the  CART  trees  from  this  training 
data.  Since  we  had  two  sets  of  training  data  for  each 
topic,  we  grew  two  trees  for  each  topic. 

•  Third,  we  used  the  algorithms  described  in  Section  3 
to  convert  the  CART  trees  into  a  TOPIC  readable 
form.  This  produced  four  TOPIC  definitions  for  each 
of  the  information  need  statements.  Table  1  above 
shows  the  various  combinations. 

•  Fourth,  we  ran  the  TOPIC  definitions  against  the 
indexed  unseen  data.'^  Again,  to  minimize  noise 
effects,  we  used  only  the  Associated  Press  articles  on 
Disk  3  to  generate  our  official  results. 

•  Fifth,  we  sorted  and  merged  the  results  generated  by 
TOPIC  and  converted  them  into  the  TREC  format  for 
scoring  by  NIST. 

4.2  Discussion  of  Official  Results 

The  official  results  for  adsl  and  ads2,  together  with 
the  unofficial  results  for  ads3  and  ads4,  are  shown  in 
Table  2. 


Table  2:  TREC-2  Results  (AP  Only) 


Run 

No. 

No. 

Rel. 

Av. 

Exact 

ID 

Retr. 

Rel. 

Ret. 

Prec. 

Prec. 

adsl 

40,423 

5,677 

822 

0.0195 

0.0390 

ads2 

33,034 

5,677 

1,468 

0.0821 

0.1092 

ads3 

49,006 

5,677 

1,182 

0.0168 

0.0374 

ads4 

50,000 

5,677 

1,847 

0.0630 

0.0868 

The  first  observation  is  that  the  trees  built  using 
exact  words  as  features  (i.e.,  results  ads2  and  ads4)  had 
higher  precision  than  those  built  using  word  stems.  We 


9.  The  feature  specification  and  extraction  procedure  we  used  is 
identical  to  that  used  in  TREC-1  and  is  described  in  detail  in  the 
TREC-1  proceedings.  The  only  differences  are  the  addition  of  a 
stemmed  version  of  the  features  and  the  fact  that  we  do  not  make 
use  of  the  feature  count  information. 

10.  We  are  grateful  to  Verity  Inc.  for  allowing  us  to  have  access  to 
their  computer  systems  and  databases. 
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believe  that  the  explanation  for  this  is  that  by  using 
stemmed  versions  of  the  features  we  added  a  significant 
amount  of  "noise"  to  the  sample  space.  That  is,  given  the 
relatively  small  size  of  the  training  sets,  using  stems  tends  to 
reduce  the  discriminating  power  of  any  given  feature  with 
respect  to  the  training  sets.  This  manifests  itself  indirecdy  in 
two  ways.  First,  the  optimal  trees  built  with  stems  are  gener- 
ally smaller  than  those  built  using  exact  words;  and,  second, 
optimal  trees  built  with  stems  have  higher  cross-vahdation 
error  rates  than  those  built  using  exact  words. 

The  second  observation  is  that  the  TOPIC  trees 
built  using  model-2  (i.e.,  results  ads3  and  ads4)  had  better 
recall  than  those  built  using  model- 1.  The  explanation  for 
this  is  straightforward.  Since  the  model-2  trees  essentially 
use  all  the  features  extracted  from  the  information  need 
statement  in  a  generahzed  disjunction,  they  provide  much 
broader  coverage  than  the  model- 1  trees  which  often  use 
just  one  or  two  of  the  features. 

The  third  observation  is,  of  course,  that  these  are 
not  strong  results  since  on  average  all  the  models  performed 
in  the  low  end  of  Uie  scores  reported  by  NIST  in  the  sum- 
mary routing  table.  This  is  somewhat  disappointing  since 
our  results  in  TREC-1  led  us  to  believe  that  we  might  be 
able  to  do  significantly  better. 

Notwithstanding  the  fact  that  we  did  not  explore 
some  of  the  ideas  discussed  in  the  TREC-1  paper  (e.g.,  the 
use  of  concepts  rather  than  words  as  features,  and  the  use  of 
surrogate  split  information),  we  are  now  inclined  to  the 
view  that  the  output  from  tools  like  CART  are  best  used  as 
the  basis  for  manually  constructed  routing  topics.  To  begin 
to  explore  this  idea,  we  performed  a  number  of  auxiliary 
tests  that  we  report  in  the  following  section. 

4.3  Auxiliary  Experiments 

To  explore  the  idea  of  using  the  CART  output  as 
the  "skeleton"  for  a  manually  constructed  routing  query,  we 
selected  two  model-2  trees  to  determine  whether  a  minimal 
set  of  "edits"  could  significantly  improve  their  performance. 
We  selected  Topics  52  and  54  since  they  represented  one 
topic  for  which  the  automatically  generated  tree  did  well 
(Topic  52)  and  one  for  which  the  automatically  generated 
tree  did  poorly  (Topic  54). 

The  scores  for  Topic  52  (for  the  AP  corpus  only) 
are  shown  below: 


11.  We  do  note,  however,  that  for  a  number  of  topics  we  did  rather 
well  in  comparison  with  other  systems  (i.e..  Topics  51  and  75),  and 
that  in  absolute  terms  we  produced  a  number  of  trees  that  had 
greater  than  30%  R-precision  (i.e.,  for  Topics  52,  58,  78  and  93). 


Queryid   (Num) :  52 

Total  number  of  documents  over  all  queries 

Retrieved:  1000 

Relevant:  345 

Rel_ret:  328 
Interpolated  Recall  -  Precision  Averages: 

at  0.00  0.6667 

at  0 . 10  0 . 4684 

at  0.20  0.4032 

at  0.30  0.4032 

at  0.40  0.4032 

at  0.50  0.4032 

at  0.60  0.4013 

at  0.70  0.4013 

at  0.80  0.4013 

at  0.90  0.4003 

at  1.00  0.0000 
Average  precision  (non- interpolated)  over 
all  rel  docs: 

0 . 3828 

Precision : 


At 
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docs 
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4000 

At 

10 

docs 

0 

3000 

At 

15 

docs 

0 

4667 

At 

20 

docs 

0 

6000 

At 

30 

docs 

0 

6333 

At 

100 

docs 

0 

4100 

At 

200 

docs 

0 

3550 

At 

500 

docs 

0 

3820 

At 

1000 

docs 

0 

3280 

R-Precision   (precision  after  R   (=  num_rel 
for  a  query)   docs  retrieved) : 
Exact:  0.3739 

Here  we  see  that  this  topic  tree  does  well  —  recall  is  excel- 
lent and  precision  is  sustained  even  at  high  recall  levels. 

In  contrast,  the  topic  tree  for  Topic  54  produces  the 
following  results: 

Queryid   (Num) :  54 

Total  number  of  documents  over  all  queries 

Retrieved:  1000 

Relevant:  65 

Rel_ret :  64 
Interpolated  Recall  -  Precision  Averages: 


at 
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00 

0 

1323 
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0 

10 

0 

1323 

at 

0 

20 

0 

1323 

at 

0 

30 

0 

1323 

at 

0 

40 

0 

1323 

at 

0 

50 

0 

1323 

at 

0 

60 

0 

1323 

at 

0 

70 

0 

1102 

at 

0 

80 

0 

1102 

at 

0 

90 

0 

1097 

at 

1 

00 

0 

0000 

Average  precision  (non-interpolated)  over 
all  rel  docs: 

0 . 0907 
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Precision: 


At 

5 

docs : 

0 

0000 

At 

10 

docs  : 

0 

0000 

At 

15 

docs  : 

0 

0000 

At 

20 

docs : 

0 

0000 

At 

30 

docs  : 

0 

0000 

At 

100 

docs  : 

0 

0500 

At 

200 

docs  : 

0 

0600 

At 

500 

docs  : 

0 

1080 

At 

1000 

docs : 

0 

0640 

R-Precision  (precision  after  R  (=  num_rel 
for  a  query)   docs  retrieved) : 
Exact:  0.0308 

Thus  although  recall  is  very  good,  precision  is  completely 
unsatisfactory. 

Our  conjecture  is  that  the  automatically  con- 
structed model-2  trees  while  generally  giving  good  recall 
give  poor  precision  because  they  contain  many  extraneous 
features,  or  features  that  should  be  combined.  To  illustrate 
this,  we  considered  the  modeI-2  tree  for  Topic  52,  as  a  start- 
ing point  for  a  manually  constructed  tree.  The  initial  model- 
2  tree  is: 

Topic_52  <0r> 

*     0.86  Topic_Style_52_2  <Accrue> 


*  * 

0 

75 

"AFRICA " 

*  * 

0 

25 

"AFRICAN" 

★  * 

0 

75 

"APARTHEID" 

*  * 

0 

25 

"ARMS" 

*  * 

0 

25 

"BAN" 

*  * 

0 

25 

"BLACK" 

*  * 

0 

25 

"COMPANY" 

*  * 

0 

25 

"COMPLIANCE" 

*  * 

0 

25 

"CONTRACTS" 

*  * 

0 

25 

"CORPORATE " 

*  * 

0 

25 

"DISCUSS" 

*  * 

0 

25 

"DOCUMENT" 

*  * 

0 

50 

" DOMINATION" 

*  * 

0 

25 

"GOVERNMENT " 

*  * 

0 

25 

" INTERNATIONAL 

*  * 

0 

25 

"INVESTMENT" 

*  * 

0 

25 

"ORGANIZATION" 

*  * 

0 

25 

"PRESSURE" 

*  * 

0 

75 

"PRETORIA" 

*  * 

0 

25 

"REDUCTION" 

*  * 

0 

25 

"RESPONSE" 

*  * 

0 

75 

" SANCTIONS " 

*  * 

0 

75 

"SOUTH" 

*  * 

0 

25 

"TIES" 

*  * 

0 

25 

" TRADE " 

*  * 

0 

25 

"UNITED" 

Obviously  there  are  a  number  of  features  here  that  are  basi- 
cally "noise"  —  for  example  the  words  "COMPANY"  and 
"RESPONSE";  and  other  words  are  clearly  elements  of  a 
larger  phrase  —  for  example  the  words  "SOUTH"  and 


"AFRICA".  Notice  that,  in  general,  words  with  lower 
scores  are  always  candidates  for  elimination. 

The  result  of  this  pruning  exercise  was  the  follow- 
ing revised  definition  for  Topic  52: 

Topic_52  <0r> 

*     0.86  TopicStyle_52-2  <Accrue> 

**     0.50  S_Africa  <Accrue> 

***     0.50    'SOUTH  AFRICA' 

***     0.50  "PRETORIA" 

**     0.50  'SANCTIONS' 

**     0.20  Topic_52_Support  <Accrue> 

***     0.50  "APARTHEID" 

***     0.50  <Near> 

****  'BAN' 

****  'TRADE' 

***     0.50  <Near> 

****  'BAN' 

****  'INVESTMENT' 

So  although  we  have  added  no  new  features,  we  have  com- 
bined "SOUTH"  and  "AFRICA"  and  used  this  together 
with  "  PRETORIA"  to  define  a  concept  called  S_Af  rica. 
We  have  also  used  "APARTHEID";  and  "BAN"  with 
"TRADE"  and  "  INVESTMENT"  to  define  another  concept 
called  Topic_52_Support.  Finally  we  adjusted  the 
weights  to  give  more  prominence  to  S_Af  rica  than  Top- 
ic_52_Support. 

The  results  for  this  modified  topic  description  are: 

Queryid   (Num) :  52 

Total  number  of  documents  over  all  queries 

Retrieved:  1000 

Relevant:  345 

Rel_ret:  312 
Interpolated  Recall  -  Precision  Averages: 

at  0.00  1.0000 

at  0.10  1.0000 

at  0.20  0.9780 

at  0.30  0.9766 

at  0.40  0.9603 

at  0.50  0.9067 

at  0.60  0.8620 

at  0.70  0.8620 

at  0.80  0.8620 

at  0.90  0.7422 

at  1.00  0.0000 
Average  precision   (non-interpolated)  over 
all  rel  docs: 

0.8305 


Precision : 
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docs 
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0000 
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20 

docs 

1 

0000 

At 
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1 

0000 

At 

100 

docs 

0 

9700 
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At  200  docs: 
At  500  docs: 
At  1000  docs: 


0.8900 
0.6220 
0.3120 


R-Precision  (precision  after  R  (=  num_rel 
for  a  query)   docs  retrieved) : 
Exact:  0.8493 

So  that  although  recall  decreased  slightly  (we  now  retrieve 
312  rather  than  328  of  the  345  relevant  documents),  the  pre- 
cision is  improved  by  nearly  50  percentage  points.  Obvi- 
ously, this  is  a  significant  improvement  and  was  achieved 
with  minimal  manual  input.  The  time  required  to  make 
these  changes  was  only  of  the  order  of  10  minutes. 

We  repeated  this  exercise  with  Topic  54,  the  initial 
model-2  tree  for  which  is: 

Topic_54  <0r> 

*     0.93  Topic_Path_54_2  <Accrue> 


*  * 

*  * 

*  * 
** 


0.7  5   "MCDONNELL  DOUGLAS' 
0.7  5    "GENERAL  DYNAMICS" 
LAUNCH" 

MARTIN  MARIETTA" 
PAYLOAD" 
0.75  "ROCKET" 
0.7  5  "SATELLITE" 
SPACE" 
TITAN" 
CONTRACT • 
LAUNCH  SERVICE' 


0.10 
0.75 
0.75 


0.75 
0  .  50 
0.75 
0.75 


Notice  that  here  we  actually  added  words  that  were  part  of 
obvious  proper  names  (i.e.,  the  "GENERAL"  of  "GENERAL 
DYNAMICS",  the  "MARIETTA"  of  "MARTIN  MARI- 
ETTA", and  the  "MCDONNELL"  of  "MCDONNELL  DOU- 
GLAS"), but  otherwise  nothing  was  added.  We  also  adjusted 
the  weights  on  'AGREEMENT'  and  "LAUNCH"  to  de- 
emphasize  their  importance. 


*  * 

A 
U 

/  D 

0 

1 5 

" ARIANE " 

The  results  of  running  this  modified  query  are: 

*  * 

0 

75 

"ARIANESPACE" 

** 

0 

75 

"ATLAS" 

Queryid 

(Num) 

54 

*  * 

0 

75 

"COMMERCIAL" 

Total  number 

of  documents  over  all  queries 

*  * 

0 

75 

"CONTRACT" 

Retrieved: 

1000 

*  * 

0 

75 

"DELTA" 

Relevant : 

65 

*  * 

0 

75 

"DOCUMENT" 

Rel 

_ret : 

65 

*  * 

0 

75 

" DOUGLAS" 

Interpolated 

Recall 

-  Precision  Averages: 

*  * 

0 

75 

"DYNAMICS" 

at 

0.00 

0 

.  5800 

*  * 

0 

75 

"INDUSTRY" 

at 

0.10 

0 

.  5800 

*  * 

0 

75 

" LAUNCH " 

at 

0.20 

0 

.  5800 

*  * 

0 

75 

"MARTIN" 

at 

0.30 

0 

.  5800 

*  * 

0 

75 

"MENTION" 

at 

0.40 

0 

.  5800 

*  * 

0 

75 

" PAYLOAD" 

at 

0.50 

0 

.4592 

*  * 

0 

75 

" PRELIMINARY" 

at 

0  .  60 

0 

.4592 

*  * 

0 

75 

" RELEVANT" 

at 

0.70 

0 

.  4324 

*  * 

0 

75 

" RESERVATION" 

at 

0  .  80 

0 

.  3355 

*  * 

0 

75 

"ROCKET " 

at 

0.90 

0 

.  1474 

*  * 

0 

75 

" SATELLITE " 

at 

1 .00 

0 

.0657 

*  * 

0 

75 

"SERVICES" 

Average 

precision  ( 

non-interpolated)  over 

*  * 

0 

75 

"SPACE" 

all  rel 

docs : 

*  * 

0 

75 

"TENTATIVE" 

0 

.  3889 

*  * 

0 

50 

"TITAN" 

Precision : 

Using  the  same  kinds  of  procedures  (i.e.,  removing 
extraneous  words  and  combining  words  into  phrases)  we 
constructed  the  following  modified  tree: 

Topic_54  <0r> 

*     0.93  Topic_Path_54_2  <Accrue> 
0.10  'AGREEMENT' 
0.7  5  "ARIANE" 
0.75  "ARIANESPACE" 
0.75  Atlas_Rocket  <Sentence> 
"ATLAS" 
" ROCKET" 

0.75  Commercial_Satellite  <Sentence> 
"COMMERCIAL" 
"SATELLITE" 
0.7  5   "DELTA  II" 


*  * 

*  * 

*  * 

*  * 

*  *  * 

*  *  * 

*  * 

*  *  * 

*  *  * 
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docs 
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3333 
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docs 
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4000 
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docs 

0 

4667 

At 

100 

docs 

0 

4500 

At 

200 

docs 

0 

2700 

At 

500 

docs 

0 

1260 

At 

1000 

docs 

0 

0650 

R-Precision  (precision  after  R  (=  num_rel 
for  a  query)   docs  retrieved) : 
Exact:  0.4615 

So  that,  again  for  very  little  manual  input,  we  achieved  a 
significant  improvement  in  precision  performance;  and  this 
time  at  no  cost  to  recall. 
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It  is  interesting  to  compare  our  results  with  Verity's 
scores  for  these  two  topics.  To  do  this  we  re-scored  Verity's 
T0PIC2  results  on  die  AP  corpus  alone.'^  For  Topic  52, 
Verity's  results  were: 

Queryid   (Num) :  52 

Total  number  of  documents  over  all  queries 

Retrieved:  1000 

Relevant:  3  45 

Rel_ret:  317 
Interpolated  Recall   -  Precision  Averages: 


at  0.20  0.9130 

at  0.30  0.9130 

at  0.40  0.9000 

at  0.50  0.7609 

at  0.60  0.5942 

at  0.70  0.5679 

at  0.80  0.5049 

at  0.90  0.2027 

at  1.00  0.0927 
Average  precision   (non-interpolated)  over 
all  rel  docs: 
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0 

7425 

At 

20 

docs  : 
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9000 
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7125 
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docs  : 
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docs  : 
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2900 
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At 

500 

docs  : 
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Average  precision  (non- interpolated)  over 
all  rel  docs: 

0.7159 

Precision : 
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3170 

R-Precision  (precision  after  R   (=  num_rel 
for  a  query)   docs  retrieved) : 
Exact:  0.5846 

This  shows  the  same  recall  performance  (i.e.,  all  65  relevant 
documents  were  retrieved)  but  substantially  better  precision 
performance.  Through  the  first  30  documents  T0PIC2  gave 
excellent  results,  whereas  our  modified  model-2  result  was 
only  half  as  good.  Again  however,  the  T0PIC2  tree  is  much 
more  complex,  and  required  more  effort  to  develop 


13 


R-Precision   (precision  after  R   (=  num_rel 
for  a  query)   docs  retrieved) : 
Exact:  0.6812 

Here  we  see  better  recall  (317  of  the  345  relevant  docu- 
ments retrieved)  but  with  sUghtly  lower  precision.  The 
T0PIC2  tree  for  this  topic  is  much  more  complex  than  the 
one  we  developed,  which  explains  the  better  recall.  Notice 
however  that  both  trees  gave  perfect  precision  for  the  first 
30  documents. 

For  Topic  54,  Verity's  T0PIC2  results  were: 

Queryid   (Num) :  54 

Total  number  of  documents  over  all  queries 

Retrieved:  1000 

Relevant:  65 

Rel_ret:  65 
Interpolated  Recall  -  Precision  Averages: 

at  0.00  1.0000 

at  0.10  0.9130 


Overall,  we  are  impressed  by  the  improved  perfor- 
mance we  were  able  to  achieve  with  minimal  manual  effort. 
These  auxiliary  experiments  provide  at  least  suggestive  evi- 
dence of  the  value  of  automatic  generation  of  initial  trees. 
The  extent  to  which  this  is  consistently  achievable  will 
require  further  investigation,  and  we  hope  to  report  on  this 
inTREC-3. 


5  Commentary 

The  official  results  of  our  TREC-2  experiments 
demonstrate  that  automatic  consuuction  of  routing  queries 
from  training  documents  is  indeed  feasible.  The  queries  pro- 
duced are  in  fact  binary  classification  trees  that  are  optimal 
with  respect  to  size  (measured  in  terms  of  the  number  of  ter- 
minals in  the  ttee)  and  the  estimated  error  rate  of  the  tree. 
Unfortunately,  however,  these  trees  generally  appear  to 
have  poor  performance.  In  a  few  cases  the  trees  were  com- 
parable with  the  results  from  other  sites,  but  they  mostly 


12.  We  are  grateful  to  Verity  for  allowing  us  to  examine  their 
TREC-2  results  in  detail. 


13.  We  do  not  have  precise  figures  for  the  amount  of  effort  needed 
to  build  the  Verity  TOPIC2  trees,  but  in  general  each  topic  required 
several  hours  of  effort. 
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seemed  unsatisfactory.  We  have  not  been  able  to  identify 
any  specific  conditions  under  whicli  we  could  expect  the 
CART  trees  to  perform  well. 

We  believe,  nevertheless,  that  further  experiments 
to  assess  the  utility  of  using  a  variety  of  extensions  to  the 
basic  CART  technique  would  still  be  of  interest.  In  particu- 
lar, the  use  of  "low-level"  concepts  as  features,  surrogate 
split  information,  larger  training  sets,  and  features  drawn 
from  the  complete  corpus  rather  than  the  information  need 
statements,  are  all  likely  to  assist  in  the  construction  of  more 
effective  trees. 

The  main  focus  of  our  TREC-2  effort  has  been  to 
explore  the  idea  that  the  CART  trees  can  form  the  basis  of  a 
semi-automated  approach  to  building  knowledge-based 
descriptions  of  routing  topics.  To  that  end,  we  developed 
two  techniques  for  converting  CART  output  into  a  form  that 
can  be  used  by  the  TOPIC  retrieval  engine  from  Verity,  Inc. 
In  the  first  technique  we  perform  a  "lossless"  transformation 
of  the  CART  tree  into  a  TOPIC  tree.  In  the  second  we  gener- 
alize the  CART  tree,  although  add  no  new  features. 

As  the  examples  contained  in  the  paper  show,  the 
TOPIC  trees  generated  in  the  first  canonical  form  are  "skel- 
etal." This  is  just  as  we  would  expect.  Since  CART  is  a  par- 
simonious classifier,  rather  than  a  broad-based  information 
retrieval  tool,  it  produces  minimally  complex  decision  trees 
with  respect  to  the  expected  misclassification  rate. 

From  an  information  retrieval  perspective,  the  sec- 
ond canonical  form  seems  more  Uke  the  ones  we  might  have 
built  by  hand.  It  uses  a  range  of  features  and  gives  them 
weights  based  on  their  ability  to  discriminate  among  the 
training  data.  In  effect,  we  have  used  CART  to  indicate 
which  of  the  features  are  of  use  in  defining  the  topic  and 
then  generalized  the  CART  decision  function  using  our 
(external)  knowledge  of  the  information  retrieval  problem. 

Using  these  automatically  constructed  TOPIC  trees 
as  a  starting  point,  we  conducted  a  limited  series  of  tests  to 
assess  the  impact  of  performing  minimal  "editing"  of  these 
U^ees.  For  the  two  topics  selected,  we  were  able  to  produce 
significant  performance  gains  witli  edits  that  added  no  new 
information  (at  least  at  the  level  of  the  features  used)  and 
that  took  of  the  order  of  only  a  few  minutes  to  implement. 

We  are  encouraged  by  these  results,  and,  while  the 
generalization  of  them  will  require  a  more  carefully  con- 
trolled series  of  experiments,  we  are  now  of  the  opinion  that 
the  most  effective  role  for  machine  learning  techniques  in 
information  retrieval  is  as  a  tool  for  producing  candidate 
descriptions  of  information  need.  These  candidates  can  then 
be  reviewed  by  end-users  who  can  easily  make  obvious  cor- 


rections and  modifications.  We  intend  to  explore  this  idea  in 
more  detail  and  report  on  our  results  in  TREC-3. 

6  Future  Research 

There  a  number  of  directions  in  which  we  might 
develop  the  basic  research  ideas  presented  in  this  paper.  We 
briefly  consider  a  number  of  them  here. 

We  currently  use  just  two  classes  (relevant  and  not 
relevant),  but  nothing  in  CART  prevents  it  working  with 
multiple  classes.  For  the  document  routing  problem,  there  is 
a  case  to  be  made  for  adding  a  third  class  —  unknown  rele- 
vance. Adding  such  a  class  might  allow  us  to  make  use  of 
larger  training  sets  without  the  costs  associated  with  devel- 
oping large  ground  truths. 

One  way  of  extending  the  skeleton  TOPIC  trees 
produced  by  our  tool  is  to  make  use  of  external  lexical 
resources.  For  example,  we  might  investigate  the  use  of 
WordNet  as  a  way  to  expand  each  of  the  classification  fea- 
tures into  a  set  of  related  words.  Similarly,  we  might  investi- 
gate the  use  of  TOPIC'S  own  lexical  resources  (e.g.,  the 
thesaurus  and  Soundex  tools)  by  replacing  the  unstenuned 
words  in  the  topic  outline  files  with  the  appropriate  TOPIC 
operator  (e.g.,  <SOUNDEX>  word,  or  <THESARUS> 
word). 

Although  we  have  used  CART  as  the  module  for 
building  the  initial  classification  tree,  we  might  be  interested 
in  exploring  other  tree  building  tools  that  have  been  used  in 
the  machine  learning  community.  For  example,  the  C4.5 
algorithm  by  Quinlan  [6],  or  various  algorithms  based  on 
Bayesian  methods  such  as  Minimum  Message  Length  mod- 
els and  decision  graphs  [7].  All  of  these  tools  generate  deci- 
sion trees  from  training  data  but  offer  different 
mathematical  philosophies  to  justify  their  approaches. 

Finally,  as  in  all  machine  learning  problems,  the 
initial  choice  of  features  over  which  to  learn  is  extremely 
critical  to  the  overall  success  of  the  process.  An  investiga- 
tion of  various  extended  feature  definition  tools  (e.g.,  recog- 
nizing key  phrase  and  proper  names),  as  well  as  exploring 
the  impact  of  making  different  assumptions  from  TOPIC 
about  how  the  lexical  tokens  in  the  texts  are  to  be  treated, 
would  almost  certainly  yield  important  insights. 
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Introduction 

ConQuest  software  has  a  commercially  available  text  search 
and  retrieval  system*  called  "ConQuest"  (for  Concept 
Quest).  ConQuest  is  primarily  an  advanced  statistical  based 
search  system,  with  processing  enhancements  drawn  from 
the  field  of  Natural  Language  Processing  (NLP). 

ConQuest  participated  in  Category  A  of  TREC,  and  so 
produced  results  for  50  test  queries  over  the  entire  2.3 
Gigabyte  database.  In  this  category,  we  constructed  queries 
and  submitted  results  for  two  different  ranking  functions. 
These  two  functions  tested  the  difference  between  local  and 
global  document  relevancy,  and  are  fully  described  later. 

In  TREC-2,  ConQuest  had  a  very  strong  showing.  Our 
recall  scores  in  particular  improved  by  about  1 8  percentage 
points  over  the  adjusted  TREC-1  scores.  Our  precision 
scores  were  also  very  competitive. 

The  purpose  of  this  paper  is  to  discuss  how  we  prepared  for 
TREC-2:  how  queries  were  performed,  what  initial 
judgments  were  made  and  why,  and  interpretation  of  the 
results.  Then,  I  will  cover  the  tests  which  were  performed 
after  TREC-2,  and  how  these  tests  clearly  identify  the  areas 
where  ConQuest  could  most  effectively  be  improved. 

System  Architecture 

For  a  complete  discussion  of  the  system  architecture  of 
ConQuest,  see  the  TREC- 1  conference  proceedings,  or  call 
the  author.  The  following  overview  is  meant  as  a  brief 
refresher. 

ConQuest  uses  pre-built  indexes  to  perform  text  database 
searches  at  fast  speeds.  In  such  a  system,  all  text  to  be 
searched  must  first  be  indexed.  These  indexes  are  then  used 
for  all  searching;  the  original  document  data  is  not  required. 

ConQuest  uses  a  dictionary  augmented  with  a  semantic 
network  for  both  indexing  and  queries.  The  dictionary  is  a 
list  of  words  where  each  word  contains  multiple  meanings. 
Each  meaning  contains  syntactic  information  (part-of- 
speech,  feature  values),  and  a  dictionary  definition. 


The  semantic  network  contains  nodes  which  correspond  to 
meanings  of  words.  These  nodes  are  linked  to  other  related 
nodes.  Relationships  between  nodes  are  extracted  from 
machine  readable  dictionaries.  Some  example  relationship 
types  include  synonym,  antonym,  child-of,  parent-of, 
related-to,  part-of,  substance-of,  contrasting,  and  similar-to. 

The  ConQuest  dictionary  was  generated  automatically  from 
several  Machine  Readable  Dictionary  (MRDs)  sources, 
commercially  available.  This  gives  ConQuest  the  most 
robust  and  thorough  coverage  of  English  available.  It  is  the 
completeness  of  coverage  that  drives  performance  gains  in 
recall  and  precision. 

Since  ConQuest  is  a  commercially  available  product,  many 
additional  components,  not  required  for  TREC-2,  are  also 
available,  such  as  true  client/server,  graphical  user 
interfaces,  routing  and  dissemination,  and  sophisticated 
application  program  interfaces. 

Query 

Generally  speaking,  ConQuest  attempts  to  refine  and 
enhance  the  user's  query.  The  result  is  then  matched  against 
the  indexes  to  look  for  documents  which  contain  similar 
concepts. 

Queries  are  not  "understood"  in  the  traditional  sense  of 
natural  language  processing.  ConQuest  makes  no  attempt 
to  deeply  understand  the  objects  in  the  query,  their 
interaction,  or  the  user's  intent.  Rather,  ConQuest  attempts 
to  understand  the  meaning  of  each  individual  word  and  the 
importance  of  the  word.  It  then  uses  the  set  of  meanings 
and  their  related  terms  (retrieved  from  the  semantic 
networks)  as  a  statistical  set  which  is  matched  against 
document  information  stored  in  the  indexes. 


For  additional  information  on  ConQuest,  please  contact 
the  author. 
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Figure  1  The  Query  Process 

The  following  is  a  description  of  the  modules  used  for 
query: 

•  Tokenize:  Divides  a  string  of  characters  into  words. 

•  Morphology:  An  advanced  form  of  stemming; 
attempts  to  remove  suffixes  and  perform  spelling 
changes  to  reduce  words  to  simpler  forms  which  are 
found  in  the  dictionary.  For  example,  one  morphology 
rule  will  take  "babies,"  strip  the  "ies,"  add  "y,"  and 
produce  "baby,"  which  is  found  in  the  dictionary. 

•  Find  Idioms:  This  module  finds  idioms  in  the  text 
and  indexes  the  idiom  as  a  single  unit.  This  prevents 
idioms  such  as  "Dow  Jones  Industrial  Average"  from 
getting  confused  with  queries  on  "industrial  history." 
Words  inside  of  idioms  can  still  be  located 
individually,  if  desired. 

•  Query  Enhancement:  The  user  is  given  the 
opportunity  to  enhance  the  query  for  additional 
improvement  in  precision  and  recall.  There  are  many 
options  available  here,  but  the  two  most  important  are 
to  choose  meanings  and  weight  query  terms. 
Choosing  a  meaning  of  a  word  will  restrict  the 
expansion  of  words  to  only  related  terms  which  are 
relevant  to  the  chosen  meanings.  This  reduces  noise  in 
the  query.  When  running  in  automatic  mode, 
ConQuest  expands  all  meanings  of  all  words. 

Weighting  query  terms  identifies  the  importance  of  the 
various  words  in  the  query.  These  weights  are  used  by 
the  search  engine  when  ranking  documents  and 
computing  document  relevance  factors. 

•  Remove  Stop  Words:  Small  function  words — such  as 
determiners,  conjunctions,  auxiliary  verbs,  and  small 
adverbs — are  removed  from  the  query. 

•  Expand  Meanings:  Words  in  the  query  are  expanded  to 
include  related  terms. 


•   Search  and  Rank:  ConQuest  uses  an  integrated  search 
and  rank  algorithm  (described  in  the  next  section) 
which  considers  the  relevance  rankings  of  documents 
throughout  the  search  process.  Since  ranking  and 
search  are  integrated,  the  search  engine  automatically 
produces  the  most  relevant  documents  right  away. 

Queries  can  be  expanded  to  a  very  large  number  of  terms,  if 
desired.  If  the  user  wishes  for  the  greatest  amount  of  recall, 
a  5  word  query  can  be  expanded  to  200  or  300  related  terms. 

Many  other  query  features  are  also  available  in  ConQuest, 
including  wildcards,  fiizzy  spelling  expansion,  numeric  and 
date  range  searching,  boolean,  mixed  boolean  and  statistical, 
fielded  searching  (a  variety  of  types),  and  searching  over 
document  categories. 

Ranking  Factors 

Ranking  and  retrieval  with  ConQuest  uses  a  variety  of 
statistics  and  criteria,  which  are  flexible  and  can  be  modified 
to  handle  varying  requirements.  The  following  are  some  of 
the  factors  used  in  ranking: 

Completeness:  A  good  document  should  contain  at 
least  one  term  or  related  term  for  each  word  in  the 
original  query. 

Contextual  Evidence:  Words  are  supported  by  their 
related  terms.  If  a  document  contains  a  word  and  its 
related  terms,  then  the  word  is  given  a  higher  weight 
because  it  is  surrounded  by  supporting  evidence. 

Semantic  Distance:  The  semantic  network  contains 
information  on  how  closely  two  terms  are  related. 

Proximity:  A  document  is  considered  to  be  more 
relevant  if  it  contains  matching  terms  which  occur 
close  together,  preferably  in  the  same  paragraph  or 
sentence. 

Quantity:  The  absolute  quantity  of  hits  in  the 
document  is  also  included,  but  is  not  as  strong  a 
discriminator  of  relevance  as  the  other  factors. 

ConQuest  is  the  first  truly  "concept-based"  search  system  to 
operate  over  unrestricted  domains.  If  a  document  contains 
the  word  and  some  of  its  related  terms,  the  word  is  more 
likely  to  be  used  in  the  correct  context,  using  the 
"contextual  evidence"  factor  above.  In  this  way,  ConQuest 
can  determine  word  meanings  at  query  tinfie. 

Coarse  and  Fine  Grain  Ranking 

To  further  improve  retrieval  speed,  ConQuest  performs  the 
search  in  two  phases.  The  first  is  "coarse-grain."  This  phase 
is  integrated  with  the  document  search  process.  Documents 
are  output  from  the  ConQuest  search  engine  in  descending 
coarse-grain  rank  order. 

To  compute  the  coarse-grain  rank  for  a  document,  the 
statistics  for  the  words  contained  in  the  document  are 
combined  using  the  coarse-grain  ranking  function.  The 
inputs  to  this  function  include  the  semantic  network 
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strength  of  each  word,  frequency  in  query,  expansion  terms, 
inverse  document  frequency,  and  query  structure. 

Once  a  document  is  found  using  coarse-grain  ranic,  a  second 
phase  of  relevancy  ranking  is  applied,  called  "fine-grain" 
rank.  This  second  phase  uses  a  different  ranking  function 
which  has  access  to  more  local  information  within  the 
document.  The  inputs  to  this  function  include  all  of  the 
inputs  used  in  coarse-grain  ranking,  plus  word  location, 
proximity,  frequency  in  document,  and  document  structure. 

Query 
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Figure  2  Fine  and  Coarse  Grain  Ranking 

In  general,  the  coarse-grain  rank  of  a  document  represents 
global  information  on  the  document.  It  is  a  score  that 
applies  more  to  the  document  as  a  whole.  The  coarse-grain 
rank  will  be  high  for  a  document  if  it  contains  a  large 
number  of  query  words  and  related  terms,  ignoring  the 
position  of  those  terms  in  the  document. 

The  fine-grain  rank,  on  the  other  hand,  represents  local 
information,  because  the  proximity  (physical  closeness)  of 
the  terms  is  the  strongest  contributor.  The  fine-grain  rank 
of  a  document  will  be  high  if  there  is  a  single  strong 
reference  contained  in  the  document. 

As  shown  in  Figure  2  above,  the  final  document  score  is 
computed  as  a  combination  of  the  coarse-  and  fine-grain 
scores. 

Pre-TREC  Experiments 

In  preparation  for  TREC-2,  ConQuest  performed  numerous 
experiments  to  improve  the  coarse-grain  ranking  algorithms 
and  data.  These  experiments  included  the  following: 

1 .  Statistical  word  studies  (statistical  regressions  to 
predict  the  probability  that  a  document  containing  a 
word  is  relevant) 

2.  Statistical  word-pair  studies 

3.  Various  weighting  formulae 

4.  Various  query  structuring  techniques 


These  studies  were  all  performed  under  the  assumption  that 
the  coarse-grain  ranking  formula  used  for  TREC  was  weaker 
than  the  fine-grain  ranking  formula.  The  concern  was  that 
coarse-grain  ranking  did  not  retrieve  a  large  enough 
percentage  of  relevant  documents  in  the  initial  retrieval  set. 
It  was  thought  that  once  these  documents  were  retrieved,  the 
fine-grain  algorithms  would  effectively  use  proximity  and 
term  frequency  information  to  sort  the  documents  and  put 
all  of  the  truly  relevant  ones  at  the  top  of  the  list. 

Unlike  other  systems,  ConQuest  did  not  have  funding  for 
these  TREC  studies.  This  put  the  TREC  studies  in  direct 
conflict  with  other  more  pressing  concerns,  such  as 
supporting  customers,  or  providing  new  functionality  such 
as  client/server. 

As  a  result,  the  testing  from  these  early  studies  proved 
ambiguous  and  unreliable.  We  believe  that  this  was  due  to 
the  following: 

•  Since  time  and  resources  were  limited,  tests  were 
performed  on  only  a  small  number  of  queries  (5-10). 
This  did  not  provide  a  large  enough  sample  set  of 
queries  to  produce  reliable  test  results. 

•  ConQuest  never  tested  the  original  assumption  that 
coarse-grain  was  the  limiting  step  in  improving 
accuracy. 

•  The  queries  for  this  testing  were  taken  from  the 
TREC-1  final  test  queries.  However,  many  of  these 
queries  were  hastily  constructed  and  thus  added  noise 
to  the  test  results. 

Just  before  the  TREC-2  results  were  due,  ConQuest  decided 
to  concentrate  most  of  its  effort  on  improving  the  tools 
used  to  generate  queries.  The  tools  and  processes  created  are 
described  in  the  next  section. 

Generating  Queries  for  TREC-2 

Generating  queries  was  primarily  an  automatic  process, 
based  on  the  initial  TREC-2  topic  descriptions.  Manual 
input  was  used  primarily  to  remove  things:  Words,  word 
meanings,  and  expansions.  This  produced  queries  with  only 
the  terms  that  are  relevant.  If  needed,  a  user  can  also  set 
weights  for  query  terms. 

Note  that  all  manual  steps  were  performed  for  all  queries 
before  any  documents  were  retrieved.  In  other  words,  no 
feedback  information  was  used  in  generating  the  queries. 
This  makes  ConQuest  fully  compliant  with  the  rules  for  ad- 
hoc  queries  in  TREC-2. 

Automatic  Query  Generation  Steps 

A  special  program  was  created  to  convert  TREC-2  topic 
descriptions  into  ConQuest  query  log  files.  The  architecture 
of  this  program  is  show  in  Figure  3. 
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Figure  3    Program  to  Automatically  Generate  Query 
Log  Files 

The  modules  in  the  program  are  as  follows: 

Parse  Topic  -  Reads  through  the  topic  looking  for  the 
SGML  codes  (such  as  <description>).  The  location 
within  the  topic  for  all  words  in  the  query  are  preserved 
in  the  final  query  log  files. 

Tokenize  -  Divides  up  strings  into  tokens. 

Morphology  -  Locates  all  words  in  the  dictionary  and 
reduces  them  to  root  words  if  possible. 

Idiom  Processing  -  Collects  idioms  together  as  single 
terms,  such  as  "United  States." 

Remove  Stop  Words  -  Removes  conjunctions, 
determiners,  auxiliary  verbs,  prepositions,  etc. 

Remove  Function  Words  -  Removes  words  such  as 
"document,"  "relevant,"  and  "retrieve"  which  are  used 
often  in  TREC-2  narratives  but  do  not  help  retrieval. 

Expand  Word  Meanings  -  All  word  meanings  are 
expanded  using  the  ConQuest  semantic  network  and  all 
expansions  are  added  to  the  query. 

Note  that  all  of  these  steps  occur  automatically  with  no 
manual  input.  The  program  also  generates  other  statistics, 
such  as  the  count  of  each  term  in  the  query,  a  count  for  each 
term  for  each  section  of  the  query  (sections  being  the  topic, 
description,  narrative,  concepts,  and  factors),  and  the  total 
number  of  words  in  the  query. 

Manual  Query  Generation  Steps 

There  were  two  manual  steps  used  to  generate  queries: 

1 .  Remove  words,  word  meanings,  and/or  expansions 

2.  Set  term  weights  (if  necessary) 

Fortunately,  ConQuest  has  graphical  user  interfaces  (GUIs) 
for  removing  words,  word  meanings,  and  expansions  from 
the  queries  automatically  generated.  A  user  merely  brings 
up  the  query  and  uses  the  mouse  to  select  items  to  be 
deleted. 

In  TREC-2,  terms  were  not  weighted  in  the  traditional 
sense,  but  rather  were  categorized  into  three  sets: 

1 .  Terms  that  embody  the  entire  query,  which  would 
make  good  search  terms  if  used  by  themselves 

2.  Terms  which  embody  a  necessary  portion  of  the  query, 
but  not  the  entire  concept 

3.  All  other  related  terms 


These  categories  provide  simple  guidelines  for  setting  term 
weights,  which  make  it  much  easier  to  generate  queries. 
Evaluations  using  the  TREC-2  test  topics  determined  the 
functions  for  the  actual  term  weights. 

To  emphasize  once  more,  no  document  feedback  was  used 
for  these  manual  steps.  All  query  adjustments  were 
performed  without  executing  any  query.  Only  after  all 
queries  were  generated  were  the  final  results  generated. 

The  TREC-2  Results 

ConQuest  scored  very  well  in  TREC-2.  In  particular,  our 
recall  percentages  were  quite  high.  Our  average  precision 
scores  were  not  as  good,  but  still  competitive. 

ConQuest  submitted  two  sets  of  results  for  TREC-2, 
CnQstl  and  CnQst2.  Both  sets  used  the  same  coarse-grain 
algorithm  which  retrieved  the  best  5000  documents  from 
the  database.  The  difference  between  the  two  results  was 
how  these  5000  documents  were  sorted  to  derive  the  top 
1000  documents  which  were  used  for  the  official  results. 

The  first  set  (CnQstl)  used  fine-grain  as  the  only  sorting 
algorithm.  This  algorithm  primarily  depends  on  local 
proximity  information,  although  word  statistics  and  query 
structure  are  also  incorporated. 

The  second  set  of  results  (CnQst2)  was  a  weighted  average 
of  the  fine-grain  and  coarse-grain  statistics  for  each 
document.  As  it  turned  out,  this  combination  of  local  (fine- 
grain)  and  global  (coarse-grain)  statistics  provided 
significantly  better  statistics. 

The  relatively  modest  addition  of  global  information 
improved  the  results  more  than  expected.  Previous 
experience  had  always  indicated  that  fine-grain  information, 
especially  the  proximity  test,  was  the  strongest  contributor 
to  document  relevancy. 

Some  additional  insights  can  be  extracted  from  topic 
analyses  presented  at  the  TREC-2  conference.  Specifically, 
the  topics  where  ConQuest  excelled  over  other  systems 
were  also  those  which  tended  to  have  fewer  relevant 
documents  in  the  database.  This  indicates  that  local 
proximity  statistics  (used  by  ConQuest)  are  more  important 
for  these  queries,  since  most  other  systems  in  TREC-2  are 
heavily  weighted  towards  global  document  statistics.  In 
other  words,  ConQuest  appears  to  perform  better  for  queries 
where  one  needs  to  find  the  "needle  in  the  haystack." 

Post  TREC  Analysis 

After  TREC-2,  we  had  the  chance  to  clean  up  our  initial 
tests,  gather  new  statistics,  and  perform  some  additional 
analysis. 

The  first  step  in  this  process  was  to  prove  the  accuracy  of 
the  coarse-grain  algorithm.  Remember  that  initial  tests 
attempted  to  improve  the  coarse-grain  algorithm.  But  did 
the  coarse-grain  algorithm  really  need  improvement?  One 
indication  that  coarse-grain  was  accurate  was  provided  by 
the  CnQst2  run,  which  performed  better  than  expected. 
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To  check  out  the  coarse-grain  rank,  we  constructed  graphs 
which  more  clearly  shows  its  performance.  Since  fine-grain 
can  only  work  on  the  results  of  the  coarse-grain  algorithm, 
what  is  the  loss  in  recall  for  coarse-grain? 

The  following  graph  shows  the  cumulative  recall  percentage 
as  documents  are  retrieved  from  coarse-grain  rank.  Every 
time  a  relevant  document  is  retrieved,  the  recall  percentage 
gradually  inches  up  towards  100%.  Note:  these  tests  were 
run  on  just  the  Category  B  data. 
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Figure  4  Cumulative  Recall  Percentage 
for  Query  #110 

Figure  4  shows  two  exciting  discoveries.  The  first  is  that 
the  coarse-grain  performance  achieves  over  95%  recall.  This 
strongly  contradicts  our  initial  fears  that  coarse-grain  was 
not  retrieving  enough  relevant  documents. 

The  second  discovery  is  that  the  high  recall  figures  are 
achieved  quickly.  This  implies  that  ConQuest  can  retrieve 
fewer  documents  (greatly  improving  speed)  and  still  achieve 
high  accuracy. 

To  further  establish  these  claims,  we  repeated  the  analysis 
on  all  queries  in  the  TREC-2  topic  set,  then  averaged  the 
results  together,  as  shown  in  the  next  graph: 
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Figure  5  Cumulative  Recall  as  Documents  are  Retrieved 
using  Coarse-Grain  Rank 

This  figure  is  an  average  over  all  queries.  The  average 
strongly  correlates  with  the  results  from  query  #110.  This 
verifies  the  two  discoveries  identified  above. 

Some  initial  studies  also  more  clearly  show  the  difference 
between  fine-grain  and  coarse-grain  sorting  of  documents. 
The  following  figure  shows  both  graphs  superimposed: 
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Figure  6 
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Coarse-Grain  Sorting  vs.  Fine-Grain  Sorting 
for  TREC-2  Topic  #135 


In  this  diagram,  we  see  that  fine-grain  sorting  is  in  fact 
better  than  coarse-grain.  In  other  queries,  the  results  are 
more  mixed.  Clearly,  the  difference  is  not  as  great  as  was 
initially  assumed. 

This  suggests  that  the  area  where  ConQuest  can  most 
improve  is  not  in  the  coarse-grain  ranking  algorithm,  but 
rather  in  improving  the  fine-grain  algorithm,  or  providing  a 
better  combination  of  the  two. 

Upon  further  study,  we  believe  we  now  know  why.  When 
the  fine-grain  algorithm  was  developed,  the  programmers 
assumed  an  average  query  length  of  about  5  words.  Studies 
of  typical  users  indicate  that  their  preferred  query  type  is  a 
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simple  two  or  three  word  phrase.  Therefore,  a  fine-grain 
rank  tuned  for  short  queries  has  provided  the  best  and  most 
accurate  system  for  our  commercial  users. 

In  TREC-2,  however,  the  queries  are  often  40-50  words 
long.  These  analyses  have  shown  us  that  our  initial  fine- 
grain  algorithms  are  not  as  accurate  for  such  long  queries. 

Future  Analysis 

Our  tests  indicate  that  more  study  of  other  fine-grain 
algorithms  is  where  ConQuest  can  most  likely  improve  its 
scores  for  TREC-3.  New  ways  of  looking  at  proximity  and 
positional  information  in  the  document  will  be  explored  and 
compared  against  the  existing  coarse-grain  ranking  results. 

We  still  feel  that  our  existing  fine-grain  algorithm  is  best 
for  the  typical  commercial  user,  and  we  are  looking  for 
ways  to  fully  test  this  hypothesis. 

Finally,  we  are  now  much  more  sensitive  to  the  effects  of 
query  size  on  fine-grain  algorithms  and  are  looking  more 
closely  at  ways  to  desensitize  our  fine-grain  algorithms,  or 
to  adapt  them  easily  to  different  query  lengths. 
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Abstract 


This  paper  describes  an  application  of  the 
Combination  of  Expert  Opinion  technique  to 
combine  the  results  of  multiple  retrieval 
methods  used  on  the  TREC-2  collection. 
The  methods  being  combined  were  weighted 
by  their  TREC-1  performance. 

1.  Introduction 

This  paper  describes  work  done  on  the 
TREC-2  project  at  PRC  Inc.  in  collaboration 
with  Professor  Edward  Fox  and  his 
colleagues  at  Virginia  Polytechnic  Institute 
and  State  University  (VPI&SU).  The  reader 
should  refer  to  the  description  of  their 
system  included  in  these  working  notes  for 
further  details  on  the  common  processing  of 
the  TREC-2  data  shared  by  PRC  and 
VPI&SU  (Fox  et  al.  1993).  PRC  used  its 
algorithm,  the  Combination  of  Expert 
Opinion  (CEO),  to  combine  the  results  of 
VPI&SU's  runs.  VPI«feSU  used  a  different 
combination  technique  for  their  final  results. 
Originally  the  intent  was  that  the  CEO 
algorithm  would  be  integrated  with  the 
SMART  system  used  by  VPI&SU.  Both 
upper  and  lower  level  combination  of  results 
would  take  place,  i.e.,  at  the  lower  level  of 
individual  document  features  within  a 
particular  retrieval  method  and  the  upper 


level  of  combination  of  the  output  of  the 
individual  methods  themselves,  i.e.,  the 
various  cosine  and  p-norm  methods  used  by 
VPI&SU.  For  TREC-1  we  were  not  able  to 
train  the  CEO  algorithm,  so  that  the 
weighting  of  the  various  methods  would  be 
optimized  based  on  relevance  judgments. 
For  TREC-2  we  used  the  11  point  average 
scores  obtained  by  the  various  methods  for 
TREC-1  for  weighting.  Again  we  only  used 
the  upper  level  of  CEO.  For  TREC-1  we 
found  that  combining  all  methods  resulted  in 
lower  performance  than  using  the  single  best 
method.  This  year  our  first  version 
combined  the  top  two  methods,  based  on 
TREC-1,  while  the  second  version  used  the 
top  five  methods. 

2.  Combination  of  Expert  Opinion 

The  statistical  technique  of  CEO  provides  a 
solution  to  the  problem  of  combining 
different  probabilistic  models  of  document 
retrieval.  This  technique  is  expected  to 
result  in  improved  precision  and  recall  over 
that  provided  by  any  one  model,  or  method, 
since  research  has  shown  that  various 
retrieval  models  retrieve  different  sets  of 
more  or  less  equally  relevant  documents 
(Katzer  et  al.  1982,  Fox  et  al.  1988).  In  the 
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Bayesian  formulation  of  the  CEO  problem 
(Lindley  1983)  a  decision  maker  is  interested 
in  some  parameter  or  event  for  which  he/she 
has  a  prior,  or  initial,  distribution  or 
probability.  The  decision  maker  revises  the 
distribution  upon  consulting  several  experts, 
each  with  his/her  own  distribution  or 
probability  for  the  parameter  or  event.  To 
effect  this  revision,  the  decision  maker  must 
assess  the  relative  expertise  of  the  experts 
and  their  interdependence,  both  with  each 
other  and  the  decision  maker.  The  experts' 
distributions  are  considered  as  data  by  the 
decision  maker,  which  is  used  to  update  the 
prior  distribution. 

For  automatic  document  retrieval,  the 
retrieval  system  is  the  decision  maker,  and 
different  retrieval  algorithms,  or  models,  are 
the  experts  (Thompson  1990a,b,  1991).  This 
is  referred  to  as  the  upper  level  CEO.  At 
the  lower  level  the  probabilities  of  individual 
features,  e.g.,  terms,  within  a  particular 
retrieval  model  can  be  combined  using  CEO. 
In  lower  level  CEO  the  retrieval  model  is 
the  decision  maker  and  the  term  probabilities 
are  viewed  as  lower  level  experts.  The 
probability  distributions  supplied  by  these 
lower  level  experts  can  be  updated, 
according  to  Bayes  theorem,  by  user 
relevance  judgments  for  retrieved  documents. 
These  same  relevance  judgments  also  give 
the  system  a  way  to  evaluate  the 
performance  of  each  model,  both  in  the 
context  of  a  single  search  of  several 
iterations  and  over  all  searches  to  date. 
These  results  can  be  used  in  a  statistically 
sound  way  to  weight  the  contributions  of  the 
models  in  the  combined  probability 
distribution  used  to  rank  the  retrieved 
documents.  Since  various  algorithms,  such 
as  p-norm,  are  expressed  in  terms  of 
correlations  rather  than  probability 
distributions,  it  was  necessary  to  extend  the 
CEO  algorithm  to  handle  correlations.  So 
far  this  extension  has  been  handled  in  a 


heuristic  fashion.  If  a  retrieval  method,  e.g., 
one  of  the  cosine  methods,  returned  a  value 
between  0  and  1  as  a  retrieval  status  value; 
the  logistic  transformation  of  this  weight  was 
interpreted  as  an  estimate  of  the  mean  of  a 
logistically  transformed  beta  distribution 
which  was  provided  as  evidence  to  the 
decision  maker.  Since  there  was  no  basis 
with  which  to  assign  a  standard  deviation  to 
this  distribution,  as  called  for  by  the  CEO 
methodology,  an  assumption  was  made  that 
all  standard  deviations  were  .4045,  a  value 
corresponding  to  a  standard  deviation  of  .1 
in  terms  of  probabilities.  The  CEO  code 
was  written  in  g++. 

For  TREC-1  we  used  the  CEO  algorithm  to 
combine  all  of  the  VPI&SU  retrieval 
methods  except  for  the  Boolean,  i.e., 
weighted  and  unweighted  cosine  and  inner 
product  measures  as  well  as  p-norm 
measures  of  1.0,  1.5,  and  2.0.  For  measures, 
such  as  the  inner  product  and  some  of  the 
p-norm  results  not  giving  a  retrieval  status 
value  in  the  0  to  1  range,  the  result  was 
mapped  to  this  interval  by  scaling  the 
highest  score  of  the  method  in  question  for 
a  given  topic  to  the  highest  score  given  by 
one  of  the  cosine  measures.  Default  scores 
half  way  between  0  and  the  lowest  score 
achieved  by  a  particular  method  were  used 
for  documents  not  retrieved  in  the  top  200  in 
response  to  a  given  topic,  since  the  actual 
score  of  these  documents  was  unknown.  For 
TREC-2  we  followed  the  same  approach 
except  that  only  the  results  of  methods  with 
better  TREC- 1  performance  were  combined. 
Our  first  version  used  Cosine.atn  and 
Cosine.nnn,  the  two  best  VPI&SU  methods 
from  TREC-1,  weighted  by  their 
performance  on  TREC-1.  The  second 
version  used  these  two  methods  and  the  next 
best  three.  Inner. atn.  Inner. nnn,  and  Pnorm 
1.0,  also  weighted  by  their  TREC-1 
performance  (see  VPI&SU  report  for  details 
on  these  methods).   Figures  1  and  2  show 
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our  summary  official  TREC-2  results  for 
versions  1  and  2  respectively. 


Query  id  (Num):        all  prceol 

Total  number  of  documents  over  all  queries 

Retrieved:  50000 

Relevant:  10785 

Rel_ret:  3561 
Interpolated  Recall  -  Precision  Averages: 

at  0.00  0.4768 

at  0.10  0.2613 

at  0.20  0.1807 

at  0.30  0.1202 

at  0.40  0.0875 

at  0.50  0.0504 

at  0.60  0.0328 

at  0.70  0.0169 

at  0.80  0.0142 

at  0.90  0.0077 

at  1.00  0.0031 
Average  precision  (non-interpolated)  over  all 
rel  docs 

0.0904 

Precision: 

At    5  docs:  0.2120 

At    10  docs:  0.2480 

At    15  docs:  0.2707 

At    20  docs:  0.2760 

At   30  docs:  0.2753 

At  100  docs:  0.2418 

At  200  docs:  0.1756 

At  500  docs:  0.1086 

At  1000  docs:  0.0712 
R-Precision  (precision  after  R  (=  num_rel  for 
a  query)  docs  retrieved): 
Exact:  0.1703 

Figure  1:  Summary  scores  for  PRC  version 
1  using  two  best  experts 


Queryid  (Num):        all  prceol 

Total  number  of  documents  over  all  queries 

Retrieved:  50000 

Relevant:  10785 

ReLret:  3323 
Interpolated  Recall  -  Precision  Averages: 


at  0.00 

0.5963 

at  0.10 

0.3270 

at  0.20 

0.2306 

at  0.30 

0.1430 

at  0.40 

0.0866 

at  0.50 

0.0425 

at  0.60 

0.0323 

at  0.70 

0.0246 

at  0.80 

0.0209 

at  0.90 

0.0131 

at  1.00 

0.0041 

Average  precision  (non-interpolated)  over  all 
rel  docs 

0.1120 

Precision: 

At    5  docs:  0.4120 

At    10  docs:  0.4000 

At    15  docs:  0.3920 

At    20  docs:  0.3740 

At    30  docs:  0.3527 

At  100  docs:  0.2722 

At  200  docs:  0.1974 

At  500  docs:  0.1090 

At  1000  docs:  0.0665 
R-Precision  (precision  after  R  (=  num_rel  for 
a  query)  docs  retrieved): 
Exact:  0.1809 

Figure  2:  Summary  scores  for  PRC  version 
2  using  five  best  experts 
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3.  Conclusion 

Although  selecting  the  best  individual 
TREC-1  VPI&SU  retrieval  methods  for 
combination  and  weighting  them  by  their 
TREC-1  performance  seemed  to  be  a 
reasonable  strategy  which  would  yield  better 
retrieval  results  than  unweighted 
combination  of  all  methods,  in  fact  CEO 
performance  was  worse  on  TREC-2.  This 
was  due,  in  part,  to  changes  made  in  the 
individual  methods  which  made  methods  that 
had  been  best  in  TREC-1  less  effective  than 
other  methods  for  TREC-2.  These  results 
suggest  that  selection  and  weighting  of 
methods  for  combination  based  on 
performance  of  earlier  versions  of  the 
methods  is  unwarranted. 
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1.0  Introduction 


Systems  Environments  is  a  commercial  category  B  partici- 
pant in  TREC.  For  the  past  two  years  we  have  been  devel- 
oping a  software  system  starting  on  a  DOS  based  PC  and 
recently  moving  to  Windows  NT  on  the  same  486/25  PC. 
Much  development  time  has  been  diverted  toward  getting 
around  bugs  and  trying  to  fit  the  system  on  small  (200  Mb 
disks  and  within  the  memory  constraints  of  the  DOCS  sys- 
tem). At  this  time  the  system  is  still  under  development. 


2.0  System  Architecture 


FORMS  (Feedback,  Object-oriented  Retrieval  Methods)  is 
an  object  oriented,  concept  based  information  retrieval 
system. 

FORMS  is  concept  based  because  it  creates  a  profile  of  the 
users  information  need  and  matches  that  profile  to  the  doc- 
ument profiles.  Profile  can  be  stored  in  a  library  of  con- 
cepts which  is  retained  from  one  use  of  the  system  to 
another.  Relevance  feedback  is  an  essential  ingredient  in 
the  design  of  FORMS  although  so  far  it  has  not  been  used 
in  TREC. 


FORMS  is  designed  and  implemented  using  the  object  ori- 
ented approach.  Object  oriented  methods  were  chosen  to 
create  a  system  architecture  that  can  be  installed  in  a  vari- 
ety of  different  environments  using  different  architectures. 
In  its  smallest  implementation  FORMS  can  function  as  a 
personal  information  system  on  a  single  PC,  or  it  can  be 
installed  as  a  organization  wide  system  using  client  server 
approach.  The  basic  objects  and  their  relationships  are 
shown  in  the  following  diagram  and  described  in  the  fol- 
lowing sections. 

2.1    SGML  Documents 

A  document  arrives  and  first  must  be  translated  into  an 
SGML  (Standard  Generalized  Markup  Language)  Docu- 
ment. 
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User  Interface 


2.2    Document  Profiles 

Elements  are  the  things  found  in  text  documents  that  cap- 
ture the  meaning  imbedded  in  the  documents.  The  basic 
element  in  text  is  obviously  the  words.  But  there  are  other 
elements  that  can  be  very  helpful.  Identifying  certain  kinds 
of  proper  nouns  can  be  crucial  to  determining  the  rele- 
vance of  text  to  an  information  need,  hnportant  proper 
nouns  might  be  persons,  places,  and  some  kinds  of  things, 
e.g.  companies. 

There  are  two  distinct  types  of  elements  used  in  FORMS, 
simple  and  complex.  A  simple  element  consists  of  a  char- 
acter string,  e.g.  a  word,  and  an  element  type,  e.  g.  a  noun 
or  a  noun  phrase.  Complex  elements  are  various  combina- 
tions of  simple  elements.  For  example  a  boolean  'and'  of 
several  simple  elements  occuning  in  the  same  sentence, 
paragraph  or  document  would  be  a  complex  element. 

In  general  simple  elements  are  only  used  to  create  docu- 
ment profiles  elements.  A  document  profile  element  IS-A 
simple  element  that  includes  the  document  id  and  the  num- 
ber of  occurrences  of  Th.  element  in  the  document. 


2.3  Indexes 

After  the  document  profiles  are  created,  those  elements 
that  occur  in  more  than  one  document  are  stored  in  the  ele- 
ment Index.  An  second  index  from  the  document  ID  to  the 


full  text  of  the  document  is  also  created  so  that  the  user 
can  retrieve  a  document  for  inspection. 

3.0  User  Interface 

The  user  interface  objects  (grey  area  in  the  diagram)  pro- 
vide object  to  display  documents,  retrieve  documents 
using  a  natural  language  query  as  well  as  stored  concepts, 
and  a  window  that  displays  results. 

3.1  Document  Display 

The  document  display  window  permits  a  user  to  enter  a 
document  ID  and  the  system  will  simply  retrieve  the  docu- 
ment from  one  of  the  text  files  and  display  it  on  the  screen. 
When  a  document  is  displayed,  the  user  has  the  option  to 
indicate  that  the  document  is  or  is  not  relevant  to  the  cur- 
rent query  (or  concept).  When  an  indication  of  relevance  is 
given  the  system  modifies  the  concept/query  profile  and 
provides  a  new  rank  list  of  retrieve  documents. 

3.2  Query  Window 

The  query  window  actually  consists  several  parts.  A  natu- 
ral language  part  for  entering  a  query,  and  a  query  profile 
part  that  displays  the  elements  of  the  query/concept  along 
with  various  statistics  about  each  element.  Most  important 
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Basic  Assumptions 


is  the  probability  that  a  document  is  relevant  if  it  contains 
a  particular  element.  Each  element  is  treated  as  being  inde- 
pendent of  every  other  element. 

3.3    Results  Window 

The  results  window  contains  the  list  of  all  documents  with  a 
probability  of  relevance  to  the  current  query/concept.  Documents 
are  ranked  by  probability  of  relevance.  For  each  document,  the 
elements  found  in  the  document  are  listed  as  well  as  any  previous 
relevance  information.  The  document  ID  selected  from  the 
results  window  can  be  used  in  the  document  window  to  examine 
the  document. 

4.0  Basic  Assumptions 

The  approach  being  taken  by  FORMS  is  based  a  number 
of  critical  assumptions. 

•  Using  the  axioms  of  probabilities  as  a  foundation  for 
determining  the  relevance  of  documents  to  a  query 

•  Elements  besides  words,  such  as  phrases,  can  be  found 
to  make  a  significant  contribution  to  retrieval  accuracy 

•  A  number  of  different  approaches  to  identifying  rele- 
vant elements  are  worth  pursuing.  These  include 
proper  noun  identification,  part  of  speech  tagging, 
noun  phrase  tagging,  and  as  yet  undetermined  relations 
that  can  be  extracted  from  natural  language. 

5.0  Progress  to  Date  

Frankly,  there  has  been  little  progress  to  date  beyond  basic 
system  development.  In  both  TREC  I  and  II,  we  have  per- 
formed routing  queries  with  very  poor  results.  One  obvi- 
ous reason  is  that  we  have  had  to  perform  the  analysis  by 
breaking  the  texts  into  small  sections  and  doing  even  class 
B  in  sections. 

6.0  Future  Work 


FORMS  is  designed  to  provide  information  on  the  effec- 
tiveness of  different  approaches. 

•  To  examine  different  approaches  to  incorporating  rele- 
vance feedback  into  evaluating  relevance  of  other  doc- 
uments. 


•  To  examine  whether  queries  can  be  analyzed  and  clas- 
sified so  that  the  system  can  determine  which  approach 
is  most  likely  to  be  successful  for  a  particular  query  or 
concept. 

•  To  examine  whether  certain  kinds  of  elements  (where 
an  element  is  a  word,  a  phrase,  a  proper  noun,  a  verb,  a 
cooccurrence  etc.)  can  be  predetermined  to  be  helpful 
in  certain  types  of  queries. 

7.0  Results  &  Conclusion 


At  this  time  no  results  are  supported  by  work  performed 
with  FORMS.  However,  it  is  worth  emphasizing  that  the 
results  in  TREC  seem  to  support  the  view  that  there  is  no 
specific  approach  that  is  going  to  revolutionize  informa- 
tion retrieval.  Rather,  it  seems  that  improvements  are 
going  to  come  from  attention  to  details  and  fmdtng  the 
right  element  to  use  at  the  right  time. 
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UCLA-Okapi  at  TREC-2:  Query  Expansion  Experiments 

Efthimis  N.  Efthimiadis*  and  Paul  V.  Biron 

Graduate  School  of  Library  and  Information  Science 
University  of  California  at  Los  Angeles 


1  Introduction 

This  is  the  first  participation  of  the  Graduate  School  of  Li- 
brary and  Information  Science,  University  of  California  at 
Los  Angeles  in  the  TREC  Conference.  For  TREC-2,  Cat- 
egory B,  UCLA  used  a  version  of  the  Okapi  text  retrieval 
system  that  was  made  available  to  UCLA  by  City  Univer- 
sity, London,  UK.  OKAPI  has  been  described  in  TREC- 
1  (Robertson,  Walker,  Hancock-Beaulieu,  Gull  Sz  Lau, 
1993a)  as  well  as  in  this  conference  (Robertson,  Walker, 
Jones,  Hancock-Beaulieu,  &  Gatford,  1994).  Okapi  is  a 
simple  set-oriented  system  based  on  a  generalized  proba- 
bilistic model  with  facilities  for  relevance  feedback.  In  addi- 
tion OKAPI  supports  a  full  range  of  deterministic  Boolean 
and  quasi-Boolean  operations. 

1.1  Objectives 

The  main  research  objective  of  the  UCLA  participation 
in  TREC-2  was  to  investigate  query  expansion  within  the 
framework  as  provided  by  Okapi.  More  specifically,  the 
objectives  were  to: 

•  use  an  enhanced  version  of  the  Go-See- List  (GSL)  and 
evaluate  its  effect  on  retrieval  performance. 

•  investigate  the  performance  of  query  expansion  with 
and  without  relevance  information  by  varying  the 
number  of  documents  that  are  treated  as  relevant  and 
the  number  of  terms  that  are  included  in  the  expan- 
sion. 

•  compare  the  performance  of  diff"erent  ranking  algo- 
rithms for  the  ranking  of  terms  for  term  selection  dur- 
ing query  expansion. 

•  compare  the  effectiveness  in  retrieval  of  user  assigned 
relevance  judgements  against  hypothetically  assumed 
relevance  judgements  based  on  the  top  X  documents. 

•To  whom  ail  correspondence  should  be  addressed.  Grad- 
uate School  of  Library  and  Information  Science,  University  of 
Cahfomia  at  Los  Angeles,  405  Hilgaxd  Avenue,  Los  Angeles, 
CA  90024-1520,  e-mail:  iacxene@mvs.oac.ucla.edu 


1.2    The  Okapi  version  at  UCLA  and  the 
WSJ  database 

The  Okapi  system  consists  of  a  low  level  search  engine  or 
basic  search  system  (BSS),  a  user  interface  for  the  man- 
ual search  experiments  and  data  conversion  and  inversion 
utilities. 

The  UCLA  hardware  consisted  of  Sun  SPARC-2  machine 
with  32  MB  of  memory,  and  1  GB  of  disk  storage. 

The  Wall  Street  Journal  (WSJ)  database  was  used  for 
both  the  routing  and  ad-hoc  searches.  Because  of  the  lack 
of  adequate  disk  space  on  the  UCLA  machine  the  database 
was  indexed  at  City  University  by  Stephen  Walker  and  it 
was  then  transferred  (FTP-ed)  to  UCLA. 

For  TREC-2  the  Okapi  databases  were  built  by  index- 
ing mainly  the  DOCNO  and  TEXT  fields  of  the  records. 
Inverted  indexes  included  complete  within-document  posi- 
tional information,  enabling  term  frequency  and  term  prox- 
imity to  be  used.  Okapi's  typical  index  size  overhead  is 
around  80%  of  the  textfile  size.  The  elapsed  time  for  inver- 
sion of  the  WSJ  database  was  about  12  hours. 

At  this  point  it  is  worth  noting  of  (a)  the  nature  of  the 
WSJ  records,  and  (b)  a  limitation  of  Okapi's  due  to  index- 
ing. 

(a)  The  WSJ  records  consist  of  documents  that  do  not 
have  the  same  kind  of  structure  found  in  bibliographic 
databases,  such  as  INSPEC  or  ERIC.  The  records  contain 
the  full-text  of  stories  and  have  varied  length,  mostly  longer 
than  the  length  of  an  average  abstract  of  a  bibliographic 
database.  In  addition,  the  language  and  the  style  is  mostly 
'journalistic'  as  opposed  to  'scientific',  i.e.  less  structured. 
One  important  issue  is  that  some  WSJ  records  often  con- 
tain short  multi-story  articles  which  are  completely  unre- 
lated one  from  the  other.  This  type  of  record  is  usually 
a  compilation  of  a  number  of  one-  or  two-paragraph  long 
news  stories.  The  stories  share  no  content  relation  between 
them,  the  only  common  feature  is  their  co-existence  in  the 
same  record.  This  has  implications  in  retrieval  effective- 
ness, especially  when  such  records  are  included  in  the  pool 
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of  documents  that  provide  terms  for  query  expansion,  be- 
cause of  the  noise  introduced  by  the  terms  taken  from  the 
irrelevant  stories. 

(b)  This  last  issue  relates  to  a  limitation  of  Okapi.  The 
version  of  Okapi  used  at  UCLA  retrieves  documents  at  the 
record  level  only.  Retrieval  at  the  paragraph  level,  which 
would  have  facilitated  a  better  handling  of  some  issues  like 
the  above,  is  not  currently  available. 


2    The  weighting  functions 

The  weighting  of  search  terms  can  be  said  to  involve  two 
levels: 

level  1:  A  weighting  function  is  used  to  weight  the  terms 
for  the  initial  query  as  well  as  the  terms  for  subsequent 
search  iterations  of  the  same  query  or  some  modified 
version  of  the  query. 

level  2:  A  weighting  function  is  used  for  the  weighting  of 
candidate  terms  for  query  expansion. 

Sections  2.1  and  2.2  discuss  functions  used  in  level  1  and 
level  2  respectively. 

2.1    Search  term  weighting 


When  relevance  information  is  not  available  the  above 
weight  reduces  to  approximately  the  inverse  document  fre- 
quency (IDF). 

For  calculating  the  total  weight  of  a  document  the  fol- 
lowing function  was  used  which  is  based  on  the  binary  inde- 
pendence model,  and  takes  into  consideration  the  2-Poisson 
model  for  within  document  frequency  (tf)  and  the  docu- 
ment length.  These  are  described  in  detail  in  Robertson  et  I 
al  (1993b).  The  purpose  of  the  UCLA  Okapi  system  was  ' 
to  evaluate  the  existing  Okapi  models  and  therefore  did 
not  allow  for  modifications  of  the  existing  functions.  For 
compatibility  purposes  and  for  comparisons  it  was  decided 
to  use  the  BM15  (best  match)  function  for  the  runs.  The 
BM15  best  match  weigthing  function  is: 

,  .  ,  V^/'/-  \  \  I  (avedl  —  dl) 
docweightbmih  =  >  ((71 — — -tt)  X  ti;/4 )  +  A;2  X  ng  x  -i  ,,  ,  ,' 

0 

where  k\  and  are  unknown  constants.  In  the  UCLA- 
Okapi  implementation  the  values  for  these  constants  are: 
kx  —  \  and  ^2  =  1- 

2.2    Query  expansion  term  weighting 

The  ranking  algorithms  that  were  considered  for  the  rank- 
ing of  terms  for  query  expansion  were:  wpq,  emim,  porter, 
rJohi  and  rJiilo.  These  algorithms  are  described  briefly 
below. 


The  theory  of  relevance  weights  (Robertson  &  Sparck 
Jones,  1976)  provides  the  basic  probabilistic  model.  The 
binary  independence  or  relevance  weight  model  assigns  a 
weight  to  each  term  and  the  matching  function  for  each 
document  is  given  by  the  'simple  sum-of-weights'  ovei  all 
of  the  terms  in  the  query. 

The  weight  of  a  term  is  calculated  by  following  function 
which  is  also  known  as  the  f4  point-5  formula: 


Wf4  =  log 


(r  +  ■5)(7V  -n-R  +  r  +  .5) 
(n-r  +  .5){R-  r  +  .5) 


(1) 


where, 


N  is  the  total  number  of  documents  in  the  collec- 
tion; 

R  is  the  sample  of  relevant  documents  as  defined 
by  the  user's  feedback; 

n  is  the  number  of  documents  indexed  by  term  t; 
r  is  the  number  of  relevant  documents  (from  the 
sample  R)  assigned  to  term  t. 


2.2.1    The  wpq  algorithm 

This  algorithm  is  based  on  an  independence  assumption 
that  holds  between  a  query  expansion  term  and  the  terms  in 
the  entire  previous  search  formulation  (Robertson,  1990). 
According  to  the  relevance  weighting  theory,  the  inclusion 
of  term  t  in  the  search  formulation  with  weight  Wt  will 
increase  the  eff'ectiveness  of  retrieval  by 


wpq  =  wt{pt  -  qt) 


(3) 


where,  Wt  is  a  weighting  function,  which  in  this  case  is  the 
Wf^;  pt  is  the  probability  of  term  t  occurring  in  a  relevant 
document;  and  qt  is  the  probability  of  a  term  t  occurring 
in  a  non-relevant  document. 

This  means  that  irrespective  of  the  weighting  function 
(wt)  used  the  rule  for  deciding  the  inclusion  of  a  term  in  a 
query  expansion  search  should  be  based  on  the  ranking  of 
wpq  instead  of  Wt  alone.  Substituting  the  weighting  func- 
tion and  the  probability  of  relevance  in  wpq  with  r,  R,  n, 
N  we  get: 

,     {r  +  .5){N-n-R+r  +  .5)  ,r       n-r  . 
"^^  =  ^S       (n-.  +  .5)(fi-.  +  .5)     -(fl-ATTfl)  0 
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The  wpq  algorithm  combines  the  effects  of  the  relevance 
weighting  theory,  as  expressed  by  the  Wf^  component, 
which  assign  greater  importance  to  the  infrequent  terms 
with  the  frequency  of  occurrence  of  a  term  in  the  relevant 
document  set. 


2.2.2    The  emim  algorithm 

The  expected  mutual  information  measure  {emim)  is  a 
term  weighting  model  incorporating  relevance  information 
in  which  it  is  assumed  that  index  terms  may  not  be  dis- 
tributed independently  of  each  other,  (van  Rijsbergen, 
1977;  Harper  and  van  Rijsbergen,  1978;  van  Rijsbergen, 
Harper  &  Porter,  1981) 

The  emim  weight  reduces  to  the  f4  weight  when  the  "de- 
gree of  involvement",  i.e.  the  joint  probabilities,  are  all 
unity.  Assuming  the  same  definitions  for  n,  N,  r,  R,  as 
those  already  used  earlier,  the  emim  weight  of  a  term  is 
calculated  as  follows: 

Eiq     =     Pnin  -  Pl2«12  -  P21«21  +  P22«22 

,  rN 

{n-r)N 
,  {R-r)N 


+  log 


(N  -n)R 
(N  -  n-  R  +  r)N 
{N  -  n){N  -  R) 


{N  -n-  R  +  r) 


•  resolves  ties  according  to  their  term  frequency,  n,  from 
low-to-high  frequency. 

It  was  hypothesized  that  the  rJohi  algorithm  would  have 
an  almost  identical  ranking  to  porter  and  a  performance  ap- 
proaching that  of  wpq  and  emim.  More  differences  between 
the  algorithms  may  occur  if  the  size  of  the  set  of  relevant 
documents  (R)  gets  larger.  Conclusions  about  the  algo- 
rithm however  could  not  be  drawn  before  it  was  evaluated 
against  the  other  algorithms.  The  results  of  that  evaluation 
are  reported  in  Efthimiadis  (in  press)  where  the  rJohi  al- 
gorithm demonstrated  better  performance  when  compared 
to  the  other  algorithms. 


2.2.5    The  r.hilo  sort 

A  variant  of  the  rJohi  algorithm  is  to  rank  candidate  terms 
for  query  expansion  using  the  r_hilo  rank  which: 

•  ranks  terms  according  to  r,  i.e.  their  frequency  of  oc- 
currence in  the  relevant  document  set,  and 

•  resolves  ties  according  to  their  term  frequency,  n,  from 
high-to-low  frequency. 

Since  the  r^hilo  algorithm  will  result  in  sorting  terms  in 
exactly  the  opposite  way  of  the  rJolii  algorithm  it  was  in- 
cluded as  a  control  for  the  study. 


2.2.3    The  porter  algorithm 


3  Methodology 


Porter  and  Galpin  (1988)  describe  a  ranking  formula  used 
in  the  MUSCAT  online  catalogue: 


porter 


R 


n 


(5) 


where  r,  R,  n,  N  are  defined  as  in  the  f4  weight  (eq.  1). 


2.2.4    The  rJohi  algorithm 

The  rJohi  algorithm  has  been  proposed  by  Efthimiadis 
(1993a)  as  the  result  of  the  observation  of  the  ranking  be- 
havior of  six  algorithms  used  for  ranking  terms  for  query 
expansion. 

The  rJohi  ranking  algorithm: 

•  ranks  terms  according  to  r,  i.e.  their  frequency  of  oc- 
currence in  the  relevant  document  set,  in  descending 
order  and 


3.1  Runs 

Initial  tests  were  performed  in  topics  1-50  where  the  depen- 
dent variables  were  the  weighting  function  and  the  query 
processing  of  terms.  ^From  the  results  obtained  it  was  es- 
tablished that  the  function  to  use  will  be  BM15  and  that 
the  parsing  of  the  Topics  would  include  both  single  terms 
and  "phrases"  as  defined  by  comma  delimited  text  in  the 
Topics. 

The  table  below  (Table  3.1  gives  all  the  variables  used 
in  constructing  the  runs.  The  options  available  for  each 
variable  are  also  provided. 

Weighting  Function:  Best  match  function  EM  15  (see 
equation  2). 

Phrases:  Choice  of  YES,  NO,  or  BOTH.  This  determines 
the  type  of  parsing  of  the  "Concepts"  and  "Title"  fields 
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Table  3.1  Methodology  for  the  Routmg  Runs  on  Topics  51-100 

Query 

Number  of 

No.  of  Docs 

Weighting 

Expansion 

Terms 

used  for 

UCLA 

Function 

Phrases 

QE 

Algorithm 

Expanded 

Auto  Rel  Fbk 

GSL 

bml5 

no 

no 

0 

0 

no 

yes 

yes 

67711771 

10 

5 

yes 

both 

porter 

20 

10 

rjohi 

30 

15 

rjiilo 

20 

from  the  Topics,  which  were  the  source  of  the  search 
terms.  NO  means  that  the  terms  extracted  from  the 
Concepts  and  Title  fields  are  single  terms  only.  YES 
means  that  phrases  get  extracted  as  determined  by  the 
simple  routine,  where  a  phrase  is  identified  by  using 
the  punctuation  found  in  the  Concepts  and  Title  fields. 
BOTH  is  the  combination  of  the  two  methods  and  the 
terms  are  searched  as  single  terms  as  well  as  phrases. 

Query  Expginsion  (QE):  The  choice  of  query  expansion 
algorithms  is  one  of  wpq,  emim,  porter,  rJohi,  rjiilo. 

Terms  expanded:  This  specifies  the  number  of  terms  to 
include  in  the  expansion.  When  the  number  of  terms 
expanded  is  zero,  then  only  the  initial  query  is  run. 

Feedback  documents:  This  defines  the  number  of  top 
ranked  documents  to  be  treated  as  relevant  and  to 
provide  the  source  for  the  terms  for  query  expansion. 

UCLA  GSL:  defines  whether  the  standard  Okapi  GSL  or 
the  UCLA  enhanced  version  of  the  GSL  will  be  used. 

Because  of  the  many  parameters  involved  in  each  run 
the  names  of  rims  have  been  deliberately  made  explicit, 
which  however  resulted  in  rather  long  names.  For  exam- 
ple, bml5.plib.qey:r_lohi-10-5.uclagsly means  that  for 
this  rim  the  weighting  function  used  was  the  BM15,  phrases 
were  set  to  BOTH,  query  expansion  took  place,  the  rJohi 
algorithm  was  used  for  the  ranking  of  terms  for  query  ex- 
pansion, 10  terms  were  added  in  the  expansion,  5  docu- 
ments provided  the  source  of  the  terms  for  the  expansion, 
ajid  the  UCLA  enhanced  GSL  was  also  used. 


3.2  Go-See-List 

The  Go-See-List  (GSL)  is  a  look-up  table  that  contains 
stopwords,  semi-stopwords,  prefixes,  go-phrases  and  syn- 
onym classes.  The  GSL  is  used  during  the  indexing  of  a 
database  cis  well  as  during  searching. 

Stopwords  contain  an  array  of  terms  that  are  thought 
to  contain  no  or  little  value  for  retrieval.  These  include, 
contractions,  prepositions,  adverbs,  etc. 


The  semi-stopwords  are  terms  that  are  thought  to  have 
low  value  for  retrieval  purposes.  Therefore,  a  semi- 
stopword  will  be  searched  only  during  the  initial  search  if 
it  has  been  part  of  the  user's  search  statement.  If,  however, 
the  term  has  emerged  as  the  result  of  a  query  expansion  it 
is  stopped,  i.e.  excluded  from  the  pool  of  candidate  terms 
for  query  expansion. 

Go-phrases  are  mostly  noun-phrases  that  need  to  be 
searched  as  one  word  or  else  the  precision  will  be  very  low, 
e.g.  New  York.  GSL  contains  a  small  number  of  selected 
go-phrases. 

Synonym  entries  contain  a  mix  of  terms/concepts  that 
are  treated  as  synonyms  for  retrieval  purposes.  These  may 
be  true  s^Tionyms,  quasi-synonyms,  or  unrelated  semanti- 
cally  terms  which  are  grouped  together  because  of  some 
common  properties  which  have  value  for  retrieval.  Finally, 
the  synonym  entries  also  contain  term  variants  that  are 
known  to  "escape"  from  the  conflation  algorithm.  The 
structure  of  the  UCLA  GSL  is  given  in  the  table  below. 


The  Go-See-List  (GSL) 


City 

added 

UCLA 

by  UCLA 

total 

stopwords 

411 

72 

483 

semi-stopwords 

58 

58 

prefixes 

18 

18 

Go-phrases 

43 

84 

127 

Synonyms 

359 

604 

963 

For  the  UCLA  GSL,  the  Titles  and  Concepts  of  Top- 
ics 1-100  were  analyzed  and  synonym  classes  were  gener- 
ated from  the  data.  The  list  includes:  40  personal  names, 
and  250  synonym  classes.  In  addition,  a  list  of  organiza- 
tions and  a  list  of  common  business  acronyms  and  abbre- 
viations was  compiled. 

3.3    Query  term  selection 

Query  terms  were  selected  from  the  Title  and  Concepts 
fields  of  the  records.  The  processing  of  these  fields  was 
very  simple.  Programs  written  in  awk  and  perl  were  used 
to  isolate  the  required  fields,  which  were  then  parsed  and 
the  resulting  terms  stemmed  in  accordance  with  the  in- 
dexing procedures  followed  for  building  the  WSJ  database. 
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This  process  resulted  in  one-word  query  terms.  When  ap- 
propriate the  procedure  also  output  phrases  by  treating  the 
punctuation  available  in  these  fields  as  the  phrase  delime- 
ter. 

Queries  were  then  generated  automatically  from  the  Ti- 
tle and  Concepts  fields.  Exactly  the  same  queries  were  used 
in  the  Routing  and  Ad  hoc  searches. 

3.4  Term  selection  for  query  expansion 

a)  Routing  searches:  Query  expansion  in  the  routing 

searches  was  performed  through  query  modification 
without  relevance  information.  As  indicated  in  the  ta- 
ble, that  describes  the  construction  of  the  runs  inthe 
methodology  section,  the  number  of  documents  used 
could  range  from  the  top  0-20  documents,  in  incre- 
ments of  5  documents.  These  top  ranked  documents 
were  treated  as  relevant  and  were  analyzed  in  order 
to  provide  terms  for  the  expansion.  Expansion  terms 
were  selected  by  pooling  all  the  terms  and  then  weight- 
ing these  terms  with  one  of  the  five  ranking  algorithms 
as  specified  by  the  run.  Then  the  top  10,  20  or  30  terms 
were  added  to  the  original  query  terms  and  searched. 

b)  Ad  hoc  searches:  The  term  pool  consisted  of  all  the 

terms  of  the  documents  judged  as  relevant.  For  the  Ad 
hoc  searches  with  feedback  of  the  official  results,  the 
top  10  terms  as  determined  by  wpq  were  chosen  for 
expansion  and  were  searched  together  with  the  initial 
query  terms. 

c)  Rules  for  term  selection:  The  following  rules  were 

followed  for  the  inclusion  or  exclusion  of  a  term  during 
selection  for  query  expansion: 

a)  numbers  were  excluded  as  terms, 

b)  all  terms  whose  frequency  (n)  is  equal  to  the  num- 
ber of  relevant  documents  seen  {R),  i.e.,  if  n  <=  R, 
were  excluded. 

3.5  Search  procedure 

All  searches.  Routing  and  Ad  hoc,  were  automatic  and  de- 
termined by  the  specifications  made  for  each  run.  There 
were  no  manual  searches. 

3.5.1    Ad  hoc  searches  and  searchers 

There  were  no  manual  searches.  For  the  Ad  hoc  searches 
with  relevance  feedback,  i.e.  uclafl  (official  results),  rel- 
evance assessments  were  provided  by  two  searchers.  The 
odd  numbered  topics  were  assessed  by  one  searcher  and 
the  even  numbered  topics  by  the  other. 


3.5.2  Relevance  assessments 

During  the  Ad  hoc  searches,  the  guidelines  for  relevance 
judgements  were: 

a)  review  the  entire  document,  when  judging  relevance, 
even  if  it  seems  to  be  peripheral  or  not  relevant.  The 
reason  being  that  many  of  the  articles  were  found  to 
be  collections  of  brief  news  stories,  with  the  relevant 
part  of  the  text  hidden  in  (the  middle  or  the  end  of) 
the  text. 

b)  target  for  10  relevant  documents;  stop  as  soon  as  10  are 
found  or  at  the  20th  document.  However,  if  3  relevant 
have  not  been  found  continue  till  3  are  found  (this  is 
because  OKAFl  will  not  do  an  expansion  if  it  has  less 
than  3  documents). 

3.5.3  Ad  hoc  additional  runs 

Following  the  TREC  conference,  a  set  of  runs  was  con- 
ducted on  the  Ad  hoc  queries  in  order  to  complete  the  eval- 
uation of  the  five  ranking  algorithms  for  query  expansion 
that  were  studied. 

The  relevance  judgements  made  in  the  Ad  hoc  run 
uclafl  (fdbk.bml5.phb.qey:wpq-10-10.uclagsly)  were  ex- 
tracted and  used  in  the  subsequent  runs.  The  process  fol- 
lowed in  these  additional  runs  is  described  below: 

•  Four  new  Ad  hoc  runs  were  done;  one  for  each  of  the 
remaining  algorithms  which  were  used  for  the  ranking 
of  terms  for  query  expansion,  i.e.,  emim,  porter,  r^hilo, 
rJohi. 

•  The  same  initial  query,  which  was  generated  automat- 
ically, was  used  for  all  searches. 

•  The  relevance  judgements  made  in  the  initially  re- 
trieved set  of  the  official  Ad  hoc  run  were  extracted 
and  then  simulated  in  the  additional  runs. 

•  Query  expansion  terms  were  ranked  using  the  algo- 
rithm that  was  designated  by  each  run.  The  10  top 
ranked  terms  from  the  pool  were  added  to  the  query. 

3.6    Problems  &  Limitations 

Lack  of  equipment  has  been  a  major  problem  in  our  par- 
ticipation. In  order  to  participate  in  TREC,  SUN  Mi- 
crosystems provided  an  equipment  grant  (SUN  Sparc-2) 
in  March,  however  no  disk  was  initially  available,  but  a  1- 
Gigabyte  disk  was  acquired  in  June.  Consequently,  only 
the  Ad  hoc  runs  were  included  in  the  official  results. 
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A  limitation  of  the  UCLA  version  of  OKAPI  is  that  it 
does  not  allow  modifications  of  the  basic  retrieval  functions 
(i.e.,  the  BMs  or  best  match  functions). 


4    Results  and  Discussion 

The  results  of  the  Routing  runs,  the  Ad  hoc  runs  and  the 
Ad  hoc  additional  runs  are  given  in  Table  1,  Table  2  and 
Table  3  respectively. 

Routing  runs 

The  35  Routing  runs  given  in  Table  1  are  presented  in 
descending 

recall  values.  The  runs  bml5.ph[ynb]  .qen.uclagslCyn] , 
i.e.,  the  runs  without  query  expansion,  were  used  as  base- 
line runs  in  order  to  facilitate  comparisons.  All  other  runs 
reported  in  the  table  include  query  expansion. 

The  results  indicate  that  runs  with  query  expansion, 
where  the  rJohi  or  the  r-hilo  algorithm  was  used  performed 
better  than  all  other  runs  in  terms  of  Recall,  Average  Pre- 
cision, and  R-Precision. 

Ad  hoc  runs 

>From  the  three  oflBcial  Ad  hoc  runs,  uclaal,  was  the  au- 
tomatic run  that  did  not  include  query  expansion  and  has 
been  used  as  a  baseline-run,  uclaa2,  was  an  automatic  run 
that  included  query  expansion  without  any  relevance  infor- 
mation, and  uclaf  1,  was  a  run  with  user  supplied  relevance 
feedback  and  query  expansion. 

In  terms  of  R-Precision  and  Average  Precision  the  run 
with  feedback  and  query  expansion  (uclaf  1)  did  better 
than  the  automatic  run  with  query  expansion  (uclaa2), 
but  the  baseline  was  slightly  better. 

Ad  hoc  additional  runs 

The  results  of  the  Ad  hoc  additional  runs  are  given  in  Ta- 
ble 3.  The  official  run  with  feedback  (uclafl)  using  wpq 
for  the  expansion  is  compared  to  the  runs  which  used  the 
rJohi,  r.hilo,  emim  and  porter  algorithms  respectively  for 
the  expansion.  The  results  indicate  that  rJohi  and  rJiilo 
have  performed  better  than  the  other  algorithms.  These 
results  further  corroborate  the  results  obtained  from  the 
routing  runs. 


In  order  to  further  validate  the  results  the  sign  test  as 
well  as  the  t-test  were  performed  on  the  data.  The  results 
from  the  sign  test  are  given  on  Tables  4-15.  The  tables 
are  arranged  in  sequence  starting  from  Precision  at  15,  30, 
and  100  documents,  Average  Precision,  Recall- Precision, 
to  Recall.  In  each  case,  two  tables  are  given;  the  first  ta- 
ble gives  the  differences  and  the  second  the  probabilities. 
As  it  can  be  expected  there  are  no  differences  at  Precision 
at  5  documents  and  at  Precision  at  10  documents  because 
these  were  the  same  for  all  five  runs.  For  this  reason  the 
corresponding  pairs  of  tables  have  not  been  included  in  the 
paper.  The  results  also  show  no  significant  differences  at 
Precision  at  15  documents  and  at  30  documents.  Signifi- 
cant results  appear  at  Precision  at  100  documents  where 
rJohi  «  r-hilo  >  emim      wpq  «  porter. 

The  sign  test  results  on  Average  Precision  demonstrate 
that  rJohi  «  r_hilo  >  wpq  ^  emim  »  porter,  where 
emim  >  porter.  The  results  on  Recall  show  some  group- 
ing between  the  algorithms,  so  that  rJohi  w  rjiilo  > 
emim  ft:  wpq  >  porter.  The  results  from  the  Recall- 
Precision  indicate  that  rJohi  fa  r_hilo  emim  >  wpq  fs 
porter  with  rJohi  >  emim  but  not  significantly  better  and 
with  wpq  slightly  better  than  porter. 

>From  the  study  of  the  sign  test  results  certain  overall 
comments  emerge  about  the  performance  of  the  five  algo- 
rithms. The  results  seem  to  be  consistent  throughout  with 
rJohi  performing  better  than  the  other  algorithms.  Dif- 
ferences between  emim,  wpq  and  porter  are  not  consistent 
but  it  seems  that  emim  is  slightly  better  than  wpq  which  is 
better  than  porter. 

To  further  strengthen  the  validity  of  the  results  the  t- 
test  was  per  formed  on  the  data.  The  t-test  results  are 
given  on  Tables  16-21.  The  tables  are  arranged  in  sequence 
from  Precision  at  15,  30  and  100  documents.  Average  Pre- 
cision, Recall-Precision,  to  Recall.  Each  table  gives  the 
Mean  difference,  the  standard  deviation  difference,  the  t- 
statistic  and  the  probability.  As  in  the  case  with  the  sign 
test  there  were  no  differences  for  Precision  at  5  documents 
and  Precision  at  10  documents  and  therefore  the  corre- 
sponding tables  have  not  been  included  in  the  paper.  Sim- 
ilarly, there  are  no  significant  differences  at  Precision  at  15 
documents  and  Precision  at  30  documents.  The  results  at 
Precision  at  100  documents  show  that  rJohi  m  rJiHo  > 
emim  «  wpq  m  porter,  this  result  is  the  same  as  the 
sign  test.  The  results  from  Average  Precision  demonstrate 
that  rJohi  w  r.hilo  >  emim  w  wpq  w  porter,  with 
emim  better  than  porter.  For  Recall  the  results  are  that 
rJohi  «  r.hilo  >  emim  ss  wpq  >  porter.  Finally, 
the  Recall-Precision  results  demonstrate  that  rJohi  « 
r.hilo  w  emim  >  wpq  >  porter,  where  r^hilo  is  bet- 
ter than  emim. 

The  results  of  the  t-tests  are  consistent  for  the  algorithms 
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and  corroborate  the  results  obtained  from  the  sign  tests. 
The  two  tests  indicate  that  rjohi  and  r_hilo  have  performed 
consistently  better  than  the  other  algorithms. 


5  Conclusions 

•  The  results  obtained  from  the  use  of  the  standard  and 
enhanced  versions  of  the  GSL  indicate  that  further  re- 
search is  needed  in  order  to  determine  the  effectiveness 
of  the  GSL-synonym  list  in  retrieval. 

•  The  combination  of  adding  10  terms  from  the  5  or  10 
top  ranked  documents  contributed  to  better  retrieval 
performance. 

The  other  term/document  combinations,  i.e.  adding 
20,  30,  or  40  terms  from  15  or  20  documents,  etc.,  had 
a  negative  effect  on  retrieval  performance. 

•  The  results  from  the  routing  searches  indicate  that 
query  expansion  (i.e.,  feedback  searches  without  rel- 
evance information,  where  X  number  of  terms  is  ex- 
tracted from  Y  number  of  top  ranked  documents  that 
are  treated  as  relevant  to  the  query)  improved  retrieval 
performance  depending  on  the  algorithm  used. 

•  The  rJohi  algorithm  (Efthimiadis,  1993a)  improved 
retrieval  performance  in  the  routing  runs  when  com- 
pared to  the  initial  (baseline)  search  which  did  not 
involve  either  a  feedback  search  or  query  expansion. 

•  In  the  Ad  hoc  searches  the  results  of  the  evaluation 
of  the  five  ranking  algorithms  indicate  that  rJohi  per- 
formed better  than  the  other  algorithms.  These  results 
were  further  validated  by  the  results  obtained  from  the 
sign  test  and  the  t-test. 

•  Although  query  expansion  seems  to  work,  the  retrieval 
performance  achieved  was  less  than  expected. 

There  are  many  reasons  that  account  for  these  results 
and  which  are  briefly  addressed  below. 

1.  Completeness  of  the  TREC  Queries: 

The  major  factor  that  is  being  attributed  to  these 
results  is  that  the  queries,  i.e.  TREC  Topic  De- 
scriptions, are  almost  complete,  i.e.  contain  all 
the  important  words  required  for  the  search. 
Query  expansion  is  the  process  of  supplementing 
the  original  query  terms  and  is  particularly  effec- 
tive when  incomplete  queries  are  available. 
Query  expansion  on  these  rather  complete  queries 
seemed  to  have  contributed  to  a  small  or  even 
a  detrimental  effect  in  overall  retrieval  perfor- 
mance. 


2.  Size  of  the  TREC  collection: 

The  large  size  of  the  TREC  collection  raises  the 
issue  of  scalability  and  effectiveness  of  retrieval 
algorithms.  The  TREC  collection  is  very  differ- 
ent from  that  of  the  standard  IR  test  collections, 
such  as  ADI,  Cranfield,  CACM,  NPL.  TREC  is 
1-4  Gigabytes  of  text  whereas  the  other  collec- 
tions are  smallish  in  size,  i.e.,  only  a  few  (1-50) 
Megabytes.  The  behavior  and  effectiveness  of  al- 
gorithms in  information  retrieval  has  been  stud- 
ied in  small  collections  and  TREC  provides  the 
challenge  of  scalability. 

3.  Nature  of  documents: 

The  documents  in  the  WSJ  database  are  mostly 
long  documents;  full-text  as  opposed  to  short 
bibliographic  records;  less  structure  when  com- 
pared to  bibliographic  records;  and  with  lan- 
guage and  presentation  less  structured  (journal- 
istic style  compared  to  scientific  style); 

4.  Length  of  documents: 

The  records  are  long  and  often  contain  short 
multi-story,  usually  unrelated,  items. 
When  such  documents  contain  relevant  informa- 
tion for  a  topic,  i.e.,  when  one  of  the  stories  is 
relevant  but  all  the  others  are  not,  these  increase 
noise  and  interfere  with  the  selection  of  terms  for 
query  expansion.  This  is  because  all  the  terms 
of  that  document  will  be  included  in  the  pool  of 
the  terms  for  query  expansion  and  there  may  be 
a  number  of  terms  from  other  stories  in  that  doc- 
ument that  will  be  ranked  higher  than  the  terms 
from  the  relevant  story. 

This  reinforces  the  need  to  be  able  to  retrieve  at  a 
paragraph  level  rather  than  at  a  document  level. 


6    Future  Research 

•  evaluate  in  detail  the  level  of  the  effect  of  the  GSL- 
synonym  list  in  retrieval  performance 

•  evaluate  the  different  effect  of  a  local  versus  a  global 
thesaurus  for  query  expansion 

•  evaluate  the  effect  of  variable  bias  in  query  expansion 
term  weighting 

•  investigate  the  retrieval  overlap  between  different  ap- 
proaches, and 

•  explore  data  fusion  techniques  for  output  integration 
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Table  1:  Runs  with  and  without  query  expansion  for  topics  51-100. 

(Runs  are  presented  in  descending  'Recall'  values.) 


R,uii_  ricimc 

r\.  V  g  IX  \^\^ 

i.    1  v>  V,  1  Cj 

Prerfl  Ol 

Prec[l5] 

Free  [30] 

Prec[lOO] 

R-Prec 

Recall 

bml5.phb.qey  :rJohi-10-5.uclagsln 

0.5640 

0.5160 

0.4893 

0.4400 

0.3288 

0.3465 

0.7322 

bml5.phb.qey  ;rJohi- 10-10.  uclagsln 

0  2960 

0  5640 

0  5160 

0.4907 

0  4400 

0.3288 

0.3472 

0.7320 

bml5.phb.qey:rjohi-10-15.uclagsln 

0.2961 

0, 5680 

0.6160 

0. 4907 

U.O.<£Oo 

0.34  72 

A  7'30A 

bml5.phb.qey:rJohi-10-10.uclag3ly 

u.  Jybu 

0. 5640 

0 . 5 160 

0. 490  7 

n        Q  D 

U .  o JoB 

0 .3472 

n  7  Q  O  A 

bml5.phb.qey:r_hilo-10-10.uclagslr) 

U.2odU 

0.5480 

u.  4yuu 

0.4733 

0-4167 

0.3148 

0.3327 

n  TOO  "5 

U.  I  Zoo 

bml5.phb.qey:r_hilo-10-10.uclagsly 

0.2860 

0.5480 

u.  4yuu 

0  -  4  733 

0.4167 

0.3148 

bml5.phn.qen.uclagsly 

0.2849 

0.5560 

0 .5040 

0-4720 

0. 4200 

0.3164 

n  Q  "3  T  o 
U  .00^0 

0.7264 

bml  5.  phn.qen.  uclagsln 

0.2849 

0.5560 

0. 5040 

0.4720 

0.4200 

0.3164 

U.OOi  o 

0. 7264 

bmlS.phb.qen.uclagsly 

0.2846 

0.5560 

n  t^n^n 
u .  au^u 

0. 3325 

bml5.phb.qey:emi-10-10.uclag3ly 

0.2746 

0. 5240 

0.4960 

0.4573 

0.4027 

u.  .^yoz 

0.3106 

0.7113 

bml  5. phb.qey:emi- 10- 10.  uclagsln 

0.2746 

0.5240 

0. 4960 

0.4573 

0.4027 

0.3106 

0.7113 

bml5.phb.qey:rJohi- 20-5.  uclagsln 

0.2968 

0.5760 

0.5240 

0.5027 

0.4527 

0.  Jod2 

0.3508 

0.7060 

bm  15.  phb.qey:rjohi-20- 15.  uclagsln 

0.2969 

0.5800 

0.5280 

0. 5067 

0.4520 

0.3362 

0  ."^512 

0.7060 

bm  15.  phb.qey:rJohi- 20- 10.  uclagsln 

0.2967 

0.5800 

0.5260 

0. 5067 

0.4520 

0.3364 

0.3.^12 

0. 7060 

bml5.phy.qen.uclagsly 

0.2406 

0.4600 

0.4680 

0.4360 

0.3800 

0.2878 

0.2967 

0.6777 

bml5.phb.qey:wpq-10- 10.  uclagsln 

0.2590 

0.5360 

0.4980 

0.4440 

0.3900 

0.2884 

0-301 7 

0.6768 

bml5.phb.qey:wpq-10-10.uclagsly 

0.2590 

0-5360 

0.4980 

0.4440 

0.3900 

0.2884 

0.301 7 

0.6768 

bm  1 5.  phb.qey:wpq- 10- 15.  uclagsln 

0.2570 

0.5160 

0.4780 

0.4520 

0.3953 

n  o  o  o  1 

0.2963 

0.6750 

bm  1 5.  phb.qey;rJohi- 30- 15.  uclagsln 

u  .zoyo 

0. 5760 

U.  OOOO 

bml  5.  phb.qey:rJohi-30- 10.  uclagsln 

n  Ofl Qn 
u.  ^oyu 

0 . 5800 

0  3370 

U .  DDoo 

bml5.phb.qey;  r_lohi-30-5 .  uclagsl  n 

0  2894 

0  5840 

\i  .  \JOO\J 

0.5133 

0.4520 

0.3370 

0.3450 

0  6678 

bml5.phb.qey:  emi-20- 10.  uclagsl  y 

0  2458 

w  -       o  u 

0  4880 

0.4547 

0.3873 

0.2696 

0.2875 

0.6599 

bm  15. phb.qey:wpq-20-10. uclagsln 

0  2443 

0  5120 

0  4660 

0.4413 

0.3847 

0.2706 

0.2838 

0.6437 

bml5.phb.qey;wpq-10-5  .uclagsln 

0  2438 

0  5440 

0-4900 

0.4533 

0.3800 

0.2740 

0.2849 

0.6423 

bml5.phb.qey:wpq-10-5 .  uclagsly 

0  2438 

0  5440 

0  4900 

0  4533 

0  2740 

0  2849 

0  6423 

bml5.phb.qey  :por- 10-10.  uclagsly 

0  4980 

0  4560 

0  4047 

n  oflfio 

U .  ZoO  z 

u .  zy oy 

0  6362 

bml5.phb.qey:por-l  0-10.  uclagsln 

0.2509 

0.5320 

n  ORfiO 
u.  zooz 

u.  zyoy 

U.DoOZ 

bml5.phb.qey:emi-30- 10.  uclagsly 

0  2377 

0.4860 

0.4480 

0  3780 

U .  jSDDD 

0  2809 

0 .633 1 

bm  1 5.  phb.qey:por- 20- 10.  uclagsly 

0.2558 

0.5320 

0, 5080 

0.4  760 

0.4113 

0.2856 

A    O  A  O  A 

n  c  o  o 

bm  1 5.  phb.qey:wpq- 20- 15.  uclagsln 

u .  ^y 

u .  zoy u 

U.DZOo 

bm  15. phb.qey:por-30- 10.  uclagsly 

0.2494 

0.5000 

0.5080 

0.4733 

0.4167 

0.2926 

0.3088 

0.6270 

bml  5.phb.qey:wpq-30- 10.  uclagsln 

0.2318 

0.5280 

0.4920 

0.4387 

0.3753 

0.2620 

0.2744 

0.6121 

bm  1 5.  phb.qey:wpq- 30- 15.  uclagsln 

0.2271 

0.5040 

0.4860 

0.4467 

0.3700 

0.2646 

0.2668 

0.6079 

bm  1 5.  phb.qey:wpq-20- 5.  uclagsln 

0.2266 

0.5360 

0.4820 

0.4467 

0.3707 

0.2570 

0.2699 

0.6016 

bml5.phb.qey:wpq-30-5.uclagsln 

0.2173 

0.5280 

0.4680 

0.4347 

0.3660 

0.2480 

0.2651 

0.5790 

Table  2:  Ad  hoc  results:  Runs  with  and  without  query  expansion  for  topics  101-150. 
(Runs  are  presented  in  descending  'Recall'  values.) 


Run. name 

Avg  Prec 

Prec  [5] 

Prec  [10] 

Prec[l5] 

Prec  [30] 

Prec  [100] 

R-Prec 

Recall 

uclaal :  auto.bmlS.phb.qen.uclagsly 

uclaa2:  auto. bml5.phb.qey:wpq-10-10. uclagsly 

uclafl :  f dbk . bml  5  .phb . qey  :wp q- 1 0- 1 0 . uclagsly 

0.3345 
0.2957 
0.3090 

0.5840 
0.5440 
0.5880 

0.5380 
0.5180 
0.5220 

0.4973 
0.4920 
0.4893 

0.4333 
0.4207 
0.4360 

0.3098 
0.2760 
0.2884 

0.3629 
0.3289 
0.3459 

0.8155 
0.7786 
0.7745 

Table  3:  Performace  Averages  over  all  Topics  of  the  Ad  hoc  Runs  with  Query  Expansion. 

(Runs  Named  after  the  Algorithm  used  in  the  Expansion.) 


RunJMame 

Avg  Prec 

Prec[5] 

Prec[10] 

Prec[15] 

Prec[30] 

Prec[100] 

R-Prec 

Recall 

rJohi 

0.3414 

0.5880 

0.5240 

0.4947 

0.4427 

0.3152 

0.3688 

0.8290 

rJiUo 

0.3388 

0.5880 

0.5240 

0.4960 

0.4347 

0.3160 

0.3692 

0.8333 

emim 

0.3176 

0.5880 

0.5240 

0.4920 

0.4433 

0.2938 

0.3554 

0.7989 

uclafl:  wpq 

0.3087 

0.5880 

0.5240 

0.4893 

0.4360 

0.2882 

0.3460 

0.7753 

porter 

0.2990 

0.5880 

0.5240 

0.4893 

0.4280 

0.2798 

0.3323 

0.7457 
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Table  4:  Sign  Test  Differences, 


Precision  at  1 5  Docviments 


wpq 

porter 

emim 

rjohx 

r.hilo 

0 

3 

1 

3 

2 

porter 

4 

0 

4 

3 

1 

emim 

2 

4 

0 

3 

4 

rJohi 

5 

6 

4 

0 

3 

T.hilo 

5 

5 

5 

3 

0 

Table  5:  Sign  Test  Probabilities, 
 Precision  at  15  Documents  


wpq 

porter 

emtm 

rJohi 

T.hilo 

wpq 

1.0000 

porter 

1.0000 

1.0000 

emim 

1.0000 

1.0000 

1.0000 

r.lohi 

0.7266 

0.5078 

1.0000 

1.0000 

rJiUo 

0.4531 

0.2188 

1.0000 

1.0000 

1.0000 

Table  6:  Sign  Test  Differences, 

 Precision  at  30  Documents 


wpq 

porter 

emim 

rJohi 

r.hilo 

wpq 

0 

18 

12 

13 

15 

porter 

12 

0 

15 

11 

12 

emim 

14 

18 

0 

16 

19 

r.lohi 

18 

18 

18 

0 

15 

r^hilo 

13 

18 

16 

7 

0 

Table  7:  Sign  Test  Probabilities, 

 Precision  at  30  Docimients  


wpq 

porter 

emim 

r-lohi 

r-hilo 

wpq 

1.0000 

porter 

0.3613 

1.0000 

emim 

0.8445 

0.7277 

1.0000 

rJohi 

0.4725 

0.2652 

0.8638 

1.0000 

rJiilo 

0.8501 

0.3613 

0.7353 

0.1338 

1.0000 

Table  8:  Sign  Test  Differences, 

Precision  at  100  Docimients 


wpq 

porter 

emim 

T-lohi 

T.hilo 

wpq 

0 

22 

8 

8 

8 

porter 

15 

0 

15 

8 

8 

emim 

19 

24 

0 

12 

9 

rJoki 

32 

31 

27 

0 

15 

rJiilo 

32 

33 

27 

14 

0 

Table  9:  Sign  Test  Probabilities, 

 Precision  at  100  Documents  


wpq 

porter 

emim 

T-lohi 

rjiilo 

wpq 

1.0000 

porter 

0.3239 

1.0000 

emtm 

0.0543 

0.2002 

1.0000 

rJohi 

0.0003 

0.0004 

0.0250 

1.0000 

T.hilo 

0.0003 

0.0002 

0.0046 

1.0000 

1.0000 

Table  10:  Sign  Test  Differences, 


Average  Precision 


wpq 

porter 

emim 

T-lohi 

T-hilo 

wpq 

0 

29 

15 

10 

11 

porter 

20 

0 

17 

9 

10 

emim 

25 

32 

0 

13 

16 

T-lohi 

39 

40 

36 

0 

31 

T-hilo 

38 

39 

33 

17 

0 

Table  11:  Sign  Tea/ Probabilities, 


Average 

Precision 

wpq 

porter 

emim 

T-lohi 

T-hilo 

wpq 

1.0000 

porter 

0.2531 

1 .0000 

emim 

0.1547 

0.0455 

1.0000 

T-lohi 

0.0001 

0.0000 

0.0017 

1.0000 

T-hilo 

0.0002 

0.0001 

0.0223 

0.0606 

1.0000 

Table  12:  Sign  Test  Differences, 

 Recall  


wpq 

porter 

CTniTn 

T-loIll 

T-hilo 

wpq 

0 

24 

8 

10 

7 

porter 

8 

0 

10 

4 

3 

emim 

15 

25 

0 

9 

6 

T-lohi 

27 

28 

25 

0 

14 

T-hilo 

29 

32 

26 

14 

0 

Table  13:  Sign  Test  Probabilities, 


Recall 


wpq 

porter 

emim 

T-lohi 

T-hilo 

wpq 

1.0000 

porter 

0.0080 

1.0000 

emim 

0.2100 

0.0180 

1.0000 

T-lohi 

0.0085 

0.0000 

0.0101 

1.0000 

T-hilo 

0.0005 

0.0000 

0.0008 

1.0000 

1.0000 

Table  14:  Sign  Test  Differences, 


RecaU/Precision 


wpq 

porter 

emim 

T-lohi 

T-hilo 

wpq 

0 

19 

5 

10 

8 

porter 

11 

0 

7 

6 

7 

emim 

17 

23 

0 

11 

11 

T-lohi 

23 

26 

19 

0 

13 

T-hilo 

27 

26 

20 

12 

0 

Table  15:  Sign  Test  Probabilities, 


Recall/Precision 


wpq 

porter 

emim 

r-lohi 

T-hilo 

wpq 

1.0000 

porter 

0.2012 

1.0000 

emim 

0.0169 

0.0062 

1.0000 

T-lohi 

0.0367 

0.0008 

0.2012 

1.0000 

T-hilo 

0.0023 

0.0017 

0.1508 

1.0000 

1.0000 
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Table  16:  t-test, 


Precision  at  15  Dociiments 


iviccui 

SD 

Runs 

Difference 

Difference 

t 

Probability 

It*    n  1 1  f\  1  v  1  n  h  1 
/_/ttiL'/  1         III  t 

0.0013 

0.0285 

0.3305 

0.7424 

?*_  A  %  lo  j  €  TTX  %  771 

0.0040 

0.0367 

0.7712 

0.4443 

T  loJllf  CTil/tTTl 

0.0027 

0.0300 

0.6282 

0.5328 

r.hilo/ porter 

0.0067 

0.0278 

1.6973 

0.0960 

r^loki/  porter 

0.0053 

0.0325 

1.1579 

0.2525 

emim/ porter 

0.0027 

0.0380 

0.4957 

0.6224 

rjiilo/wpq 

0.0067 

0.0337 

1.3999 

0.1678 

rJoki/  wpq 

0.0053 

0.0352 

J.  .u  /  Uo 

U.Zoy  / 

emim/  wpq 

0.0027 

0.0232 

0.8137 

0.4197 

porter/  wpq 

0.0000 

0.0404 

0.0007 

0.9994 

Table  17:  t-test, 

Precision  at  30  DocT-ments 

Mean 

SD 

riuns 

Difference 

Difference 

t 

r  ro  DaDility 

r^tiiio  1  r^lorii 

-0.0080 

0.0298 

-1.8984 

U.UDOO 

r^hiloj  emim 

-0.0087 

0.0583 

-1.0521 

n  0Q7Q 

1  ^lU  ILl  /  c  lltl  lit 

-0.0007 

0.0585 

-0.0810 

0.9358 

r.hilo/ porter 

0.0067 

0.0539 

0.8748 

0.3860 

rdohi/  porter 

0.0147 

0.0518 

2.0019 

0.0508 

emim/  porter 

0.0153 

0.0694 

1.5624 

0.1246 

r-hiloj  wpq 

-0.0013 

0.0534 

-0.1770 

0.8602 

rJohi/  wpq 

0.0067 

0.0530 

0.8882 

0.3788 

emim/  wpq 

0.0073 

0.0458 

1.1312 

0.2635 

porter/  wpq 

-0.0080 

0.0450 

-1.2590 

0.2140 

Table  20:  t-test, 


Recall 


Mean 

SD 

Runs 

Difference 

Difference 

t 

Probability 

r.hilo/ T.lohi 

0.0005 

0.0312 

0.1052 

0.9166 

r^hilo /  emim 

0.0363 

0.0743 

3.4503 

0.0012 

r.lohi/  emim 

0.0358 

0.0819 

3.0935 

0.0033 

r-hilo/ porter 

0.0731 

0.1094 

4.7267 

0.0000 

r.lohi/  porter 

0.0727 

0.1108 

4.6356 

0.0000 

emim/  porter 

0.0369 

0.0965 

2.7010 

0.0095 

r.hilo/ wpq 

0.0506 

0.0835 

4.2805 

0.0001 

rJohi/  wpq 

0.0501 

0.0908 

3.9007 

0.0003 

emim/  wpq 

0.0143 

0.0529 

1.9106 

0.0619 

porter/  wpq 

-0.0226 

0.0739 

-2.1583 

0.0358 

Table  18:  t-test. 


Precision  at  100  Documents 


Mean 

SD 

Runs 

Difference 

Difference 

t 

Probability 

r^hilo/  rJoki 

0.0008 

0.0267 

0.2118 

0.8332 

T.hilo/ emim 

0.0222 

0.0435 

3.6060 

0.0007 

rJohi/  emim 

0.0214 

0.0469 

3.2291 

0.0022 

r.hilo  /  porter 

0.0362 

0.0656 

3.8991 

0.0003 

r-lohi/  porter 

0.0354 

0.0648 

3.8658 

0.0003 

emim/  porter 

0.0140 

0.0535 

1.8494 

0.0704 

r.hilo/ wpq 

0.0278 

0.0557 

3.5311 

0.0009 

rJohi/  wpq 

0.0270 

0.0600 

3.1815 

0.0025 

emim/  wpq 

0.0056 

0.0301 

1.3150 

0.1946 

porter/  wpq 

-0.0084 

0.0410 

-1.4496 

0.1536 

Table  19:  t-test, 

Average  Precision 

Mean 

SD 

Runs 

Difference 

Difference 

t 

Probability 

r-hilo  /  rJohi 

-0.0026 

0.0153 

-1.1935 

0.2384 

r-hilo/ emim 

0.0212 

0.0421 

3.5637 

0.0008 

r.lohi/  emim 

0.0238 

0.0470 

3.5786 

0.0008 

r.hilo/ porter 

0.0398 

0.0615 

4.5727 

0.0000 

r.lohi/  porter 

0.0424 

0.0658 

4.5547 

0.0000 

emim/ porter 

0.0186 

0.0592 

2.2172 

0.0313 

rJiilo  /  wpq 

0.0301 

0.0537 

3.9615 

0.0002 

rJohi/  wpq 

0.0327 

0.0603 

3.8291 

0.0004 

emim/  wpq 

0.0089 

0.0335 

1.8726 

0.0671 

porter/  wpq 

-0.0097 

0.0358 

-1.9107 

0.0619 

Table  21:  t-test, 
Recall/Precision 


Mean 

SD 

Runs 

Difference 

Difference 

t 

Probability 

r_hHo/ rJohi 

0.0003 

0.0246 

0.0978 

0.9225 

r_hilo/ emim 

0.0138 

0.0450 

2.1635 

0.0354 

rJohi/  emim 

0.0134 

0.0517 

1.8396 

0.0719 

r.hilo/ porter 

0.0369 

0.0636 

4.1048 

0.0002 

rJohi/  porter 

0.0366 

0.0669 

3.8626 

0.0003 

emim/  porter 

0.0231 

0.0576 

2.8398 

0.0066 

r.hilo/ wpq 

0.0231 

0.0532 

3.0729 

0.0035 

T.lohi/  wpq 

0.0228 

0.0635 

2.5401 

0.0143 

emim/  wpq 

0.0094 

0.0313 

2.1135 

0.0397 

porter/  wpq 

-0.0138 

0.0441 

-2.2100 

0.0318 
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A  Runs 

The  runs  presented  here  are  grouped  by  algorithm  used  for  the 
expansion.  The  list  starts  with  runs  that  did  not  include  query 
expansion. 
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Abstract 

Semantic  information  obtained  from  the  public  domain 
1911  version  of  Roget's  Thesaurus  is  combined  with  key- 
words to  measure  similarity  between  natural  language  topics 
and  documents.  Two  approaches  are  explored.  In  one 
approach,  a  combination  of  keyword  relevance  and  semantic 
relevance  is  achieved  by  using  the  vector  processing  model 
for  calculating  similarity,  but  extending  the  use  of  a  keyword 
weight  by  using  individual  weights  for  each  of  its  meanings. 
This  approach  is  based  on  the  database  concept  of  semantic 
modeling  and  the  linguistic  concept  of  thematic  roles.  It  is 
applicable  to  both  routing  and  archival  retrieval  The  second 
approach  is  especially  suited  for  routing.  It  is  based  on  an  AI 
connectionist  model.  In  this  approach,  a  probabilistic 
inference  network  is  modified  using  semantic  information  to 
achieve  a  competitive  activation  mechanism  that  can  be  used 
for  calculating  similarity. 

Keywords:  vector  processing  model,  semantic  data  model, 
semantic  lexicon,  inference  network,  connectionist  model. 

1.  Introduction 

The  experiments  reported  here  use  a  relatively  efficient 
method  to  detect  the  semantic  representation  of  text.  Our 
original  method  is  based  on  semantic  modeling  and  is 
described  in  [4,17,19]. 

Semantic  modeling  was  an  object  of  considerable  database 
research  in  the  late  1970's  and  early  1980's.  Abrief  overview 
can  be  found  in  [3].  Essentially,  the  semantic  modeling 
approach  identified  concepts  useful  in  talking  informally 
about  the  real  world.  These  concepts  included  the  two  notions 
of  entities  (objects  in  the  real  world)  and  relationships  among 
entities  (actions  in  the  real  world).  Both  entities  and  rela- 
tionships have  properties. 

The  properties  of  entities  are  often  called  attributes.  There 
are  basic  or  surface  level  attributes  for  entities  in  the  real 


world.  Examples  of  surface  level  entity  attributes  are  General 
Dimensions,  Color,  and  Position.  These  properties  are 
prevalent  in  natural  language.  For  example,  consider  the 
phrase  "large,  black  book  on  the  table"  which  indicates  the 
General  Dimensions,  Color,  and  Position  of  the  book. 

In  linguistic  research,  the  basic  properties  of  relationships 
are  discussed  and  called  thematic  roles.  Thematic  roles  are 
also  referred  to  in  the  literature  as  participant  roles,  semantic 
roles  and  case  roles.  Examples  of  thematic  roles  are  Bene- 
ficiary and  Time.  Thematic  roles  are  prevalent  in  natural 
language;  they  reveal  how  sentence  phrases  and  clauses  are 
semanticaUy  related  to  the  verbs  in  a  sentence.  For  example, 
consider  the  phrase  "purchase  for  Mary  on  Wednesday" 
which  indicates  who  benefited  from  a  purchase  (Beneficiary) 
and  when  a  purchase  occurred  (Time). 

A  main  goal  of  our  research  has  been  to  detect  thematic 
information  along  with  attribute  information  contained  in 
natural  language  queries  and  documents.  In  order  to  use  this 
additional  information,  the  concept  of  text  relevance  needs 
to  be  modified. 

In  [17,19]  the  major  modifications  included  the  addition 
of  a  lexicon  with  thematic  and  attribute  information,  and  a 
modified  computation  of  a  vector  processing  similarity 
coefficient.  TTiat  research  concerned  a  Question/Answer 
environment  where  queries  were  the  length  of  a  sentence  and 
documents  were  either  a  sentence  or  at  most  a  paragraph.  At 
that  time,  our  lexicon  was  based  on  36  semantic  categories, 
and  in  that  environment,  our  semantic  approach  produced  a 
significant  improvement  in  retrieval  performance. 

However,  for  TREC-1  [4],  document  and  topic  length 
presented  a  problem  and  caused  our  semantic  approach  based 
on  36  semantic  categories  to  be  of  little  value.  However,  as 
reported  in  [4],  by  breaking  the  TREC  documents  into 
paragraphs,  a  significant  improvement  was  demonstrated. 


This  work  has  been  supported  in  part  by  NASA  KSC  Cooperative  Agreement  NCC  10-003  Project  2,  Florida  High  Technol- 
ogy and  Industry  Council  Grants  4940-11-28-721  and  4940-11-28-728. 
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In  Section  2,  we  describe  our  original  semantic  lexicon 
and  an  extension  which  uses  a  larger  number  of  semantic 
categories.  Section  3  presents  an  application  of  an  AI 
connectionist  model  to  the  task  of  routing.  Section  4  presents 
an  approach  different  than  reported  in  TREC-1  [4],  using  our 
extended  semantic  lexicon  within  the  vector  processing 
model.  Section  5  summarizes  our  researdi  effort. 

2.     The  Semantic  Lexicon 

Our  semantic  approach  uses  a  thesaurus  as  a  source  of 
semantic  categories  (thematic  and  attribute  information).  For 
example,  Roget's  Tliesaurus  contains  a  hierarchy  of  word 
classes  to  relate  word  senses  [14].  In  TREC-1  [4]  and  in 
earlier  research  [17,19],  we  selected  several  classes  from  this 
hierarchy  to  be  used  for  semantic  categories.  We  defined 
thirty-six  semantic  categories  as  shown  in  Figure  1. 

In  order  to  explain  the  assignment  of  semantic  categories 
to  a  given  term  using  Roget's  Thesaurus,  consider  the  brief 
index  quotation  for  the  term  "vapor": 

vapor 

n.  fog  404.2 

fume  401 

iUusion  519.1 

spirit  4.3 

steam  328.10 

thing  imagined  535.3 

V.  be  bombastic  601.6 

bluster  911.3 

boast  910.6 

exhale  310.23 

talk  nonsense  547.5 


The  eleven  different  meanings  of  the  term  "vapor"  are  given 
in  terms  of  a  numerical  category.  We  developed  a  mapping 
of  the  numerical  categories  in  Roget's  Thesaurus  to  the 
thematic  role  and  attribute  categories  given  in  Figure  1.  In 
this  example,  "fog"  and  "fume"  correspond  to  the  attribute 
State;  "steam"  maps  to  the  attribute  Temperature;  and  "ex- 
hale" is  a  trigger  for  the  attribute  Motion  with  Reference  to 
Direction.  The  remaining  seven  meanings  associated  with 
"vapor"  do  not  trigger  any  thematic  roles  or  attributes.  Since 
there  are  eleven  meanings  associated  with  "vapor,"  we 
indicated  in  the  lexicon  a  probability  of  1/11  each  time  a 
category  is  triggered.  Hence,  a  probability  of  2/1 1  is  assigned 
to  State,  1/11  to  Temperature,  and  1/11  to  Motion  with 
Reference  to  Direction.  This  technique  of  calculating  prob- 
abilities is  being  used  as  a  simple  alternative  to  a  corpus 
analysis. 

It  should  be  pointed  out  that  we  are  still  experimenting 
with  other  ways  of  calculating  probabilities.  For  example,  as 
in  [8],  a  probabilistic  part-of-speedi  tagger  could  be  used  to 
further  restrict  the  different  meanings  of  a  term,  and  existing 
lexical  sources  could  be  used  to  obtain  an  ordering  based  on 
frequency  of  use  for  the  different  meanings  of  a  term. 

As  reported  in  [4],  the  use  of  36  semantic  categories  caused 
problems  when  dealing  with  TREC  documents.  When  the 
size  of  a  document  is  large,  a  greater  number  of  the  36 
semantic  categories  are  triggered  in  the  document.  Also, 
when  using  the  semantic  approach  described  in  [19]  the 
probability  present  for  each  category  in  a  document  is  often 
very  close  to  one.  Consequently,  almost  every  one  of  the 
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36  semantic  categories  becomes  present  in  every  document. 
This  causes  semantic  category  weights  to  become  very  low 
and  useless  within  that  approach. 

As  reported  in  [4],  one  way  to  solve  this  problem  is  to 
break  TREC  documents  into  paragraphs.  But,  another  way 
to  solve  the  problem  of  long  documents  causing  semantic 
weights  to  be  of  little  value  is  to  have  more  semantic 
categories.  A  large  number  of  "semantic"  categories  can  be 
obtained  (for  example)  by  using  aU  the  categories  and/or 
subcategories  found  in  Roget's  Thesaurus,  instead  of  the  36 
semantic  categories  we  have  used.  This  may  be  a  deviation 
from  database  semantic  modeling.  In  any  case,  it  needs  to  be 
examined. 

Consequently,  for  the  experiments  reported  here,  a 
semantic  lexicon  was  created  based  on  all  the  word  senses 
found  in  the  public  domain  1911  version  of  Roget's  The- 
saurus. To  provide  an  example,  consider  Topic  052  as  shown 
in  Figure  2.  Fi^re  3  indicates  the  keywords  and  frequency 
information  within  Topic  052,  along  with  the  semantic 
categories  obtained  from  our  extended  lexicon  for  those 
keywords.  Note  that  stemming  was  not  used  for  the  pro- 
cessing of  Topic  052;  so,  some  keywords  in  Topic  052  were 
not  located  in  our  lexicon  (e.g.  sanctions). 

The  categories  recorded  in  our  extended  semantic  lexicon 
usethe  category  numbers  found  in  the  1911  version  of  Roget's 
Thesaurus.  These  numbers  are  then  followed  by  a  part-of- 
speedi  code  also  found  in  the  1911  version  of  Roget's 
Thesaurus.  The  number  after  the  part-of-speech  code 
represents  a  sub-category,  but  this  number  does  not  appear 
in  the  191 1  version  of  Roget's  Thesaurus.  That  number  was 
created  based  on  groupings  of  words  within  the  thesaurus. 

3.  Connectionist  Model  Routing  Experiments 

Recent  work  suggests  that  significant  improvements  in 
retrieval  performance  will  require  a  technique  that,  in  some 
sense,  "understands"  the  content  of  documents  and  queries 
and  can  be  used  to  infer  probable  relationships  between 
documents  and  queries  [2].  In  this  view,  information  retrieval 
is  an  inference  or  evidential  reasoning  process  in  which  we 
estimate  the  probability  that  a  user's  information  need  is  met 
given  a  document  as  "evidence".  The  techniques  required  to 
support  this  kind  of  inference  are  similar  to  those  used  in 
expert  systems  that  must  reason  with  uncertain  information. 
Several  probabilistically-oriented  inference  network  models 
have  been  developed  using  experimental  document  collec- 
tions [5]  during  the  past  few  years  for  information  retrieval 
[15].  These  models  are  generally  characterized  by  an 
architecture  with  two  layers  corresponding  to  documents  and 
index  terms.  The  documents  and  index  terms  are  connected 
by  direct  links.  Initially,  the  prior  probabilities  of  all  root 
nodes  (nodes  with  no  predecessors)  and  the  conditional 
probabilities  of  all  non-root  nodes  (given  all  possible 
combinations  of  their  direct  predecessors)  must  be  specified. 
A  retrieval  consists  of  one  or  more  documents  with  the  highest 
posterior  probability  for  the  given  set  of  index  terms  (evi- 
dences) whidi  represent  a  user's  information  need. 

Over  the  last  few  years,  the  technique  of  automated 
inference  using  probabilistic  inference  networks  has  become 
popular  within  the  AI  probability  and  uncertainty  community, 
particular  in  the  context  of  expert  systems  [6,7].  The  most 


<top> 

<head>  Tlpstef  Topic  Description 
<num>  Number  052 
<dom>  Domain:  International  Economics 
<titie>  Topic:  South  African  Sanctions 
<desc>  Description: 

Document  discusses  sanctbns  against  South  Africa 
<nan>  Narrative: 

A  relevant  document  will  discuss  any  aspect  of  South  African  sanctions,  such 
as:  sanctions  declared/proposed  by  a  country  against  the  South  African 
government  in  response  to  its  apartheid  policy,  or  in  response  to  pressure  by 
an  individual,  organizatbn  or  another  country;  inten-ialional  sanctrans  against 
Pretoria  imposed  by  the  United  Naltons;  the  effects  of  sanctions  against  S. 
Africa;  opposition  to  sanctions;  or,  compliance  with  sanctions  by  a  company. 
The  document  will  identify  the  sanctions  instituted  or  being  considered,  e.g., 
corporate  disinvestment,  trade  ban,  academic  boycott,  arms  embargo. 

<con>  Concept(s): 

1.  sanctions,  international  sanctfons,  economic  sanctions 

2.  corporate  exodus,  corporate  disinvestment,  stock  divestiture,  ban  on  new 
investment,  trade  ban,  import  ban  on  South  African  diamonds,  U.N.  arms 
embargo,  curtailment  of  defense  contracts,  cutoff  of  nonmilitary  goods, 
academic  boycott,  reduction  of  cultural  ties 

3.  apartheid,  white  domination,  racism 

4.  antiapartheid,  black  majority  rule 

5.  Pretoria 
<fac>  Factor(s): 

<nat>  Nationality:  South  Africa 
</fac> 

<def>  Definition(s): 
<Aop> 

Figure  2.  Topic  052. 

important  constraint  on  the  use  of  a  probabilistic  network  is 
the  fact  that  in  general,  the  computation  of  the  exact  posterior 
probabilities  is  NP-hard  [1].  Thus  it  is  unlikely  that  we  could 
develop  an  efficient  general-purpose  algorithm  whidi  would 
work  well  for  all  kinds  of  inference  networks.  There  are 
several  alternatives,  such  as  the  use  of  approximation  algo- 
rithms or  heuristic  algorithms,  and  creating  special  case 
algorithms  [9,10]. 

The  experiments  here  concern  an  attempt  at  a  heuristic 
probabilistic  inference  network  approach  based  on  an  AI 
connectionist  model.  The  connectionist  model  uses  a  com- 
petitive activation  rule  to  find  the  most  probable  retrieval 
The  term  competitive  activation  rule  refers  to  a  spreading 
activation  method  in  which  nodes  actively  compete  for 
available  activation  in  a  network.  An  initial  formulation  of 
a  competitive  activation  mechanism  was  previously  studied 
on  three  two-layer,  abstract  networks  for  diagnostic  problem 
solving  [11,13].  The  connectionist  model  proposed  here 
consists  of  a  two-layer  network  ardiitecture.  Document 
nodes  and  index  term  nodes  corresponding  to  each  layer  are 
connected  by  links  whose  weights  represent  association 
strengths  between  nodes.  These  links  are  also  viewed  as 
channels  for  sending  information  between  nodes.  Figure  4 
is  a  simple  network  consisting  of  two  document  nodes  and 
three  index  term  nodes.  At  each  moment  of  time,  each  node 
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Topic  52 

curtailment    001    201n.l  38n.2 

individual     001    372a.O  372n.2  549a.0  79a.O  87a.O 
87n.O 

considered    001  611d.O 

compHance   001    602n.3  743n.O  762n.O  772n.O 

reduction      001    103n.l  144n.O  195n.O  201n.l 
308n.l  813n.O  85n.3 

economics    002  692n.3 

corporate      003  43a.O 

response       002   462n.O  586v.3  714n.O  821n.O 
888n.2  990n3 

relevant  001  23a.l  476a.2  9v.2 

majority  001  102n.l  131n.O  33n.O 

academic  002  514a.O  537a.O  542a.O 

effects  001  780n.l0  780n.5  798n.O 

defense  001  717n.O  937n.2 

country        002    181n.O  189n.l  189n.8  189n.9 
266V.1  344n.O  371a.  1 

company      001    599n.l4  712n.2  726n.8  72n.2 
88n.l  892n.4 

policy         001    626n.2  692n.2 

aspect         001    183n.O  448n.3  7n.O 

white  001    429a.l  430a.O  441n.5  996n.5 

stock  001    lln.3  153n.4  166n.2  225n.l3  25n.3 

265a.O  31n.l  501n.2  613a.O  635n.O 
636n.O  637V.0  637v.2  780n.l0 
798n.O  800n.O  811v.O 

goods  001    780n.l0  798n.O 

black  001    349n.2  421a.O  431a.O  431n.3  432n.l 

752n.O  945a.2 

rule  001    136n.l  200n.l  466n.2  613n.5  693v.l 

697n.l  737n.l  737v.2  737v.3  749v.2 
80n.O  82n.3  963n.l 

arms  002    459d.9  719v.2  722n.O  727n.O  877n.3 

894V.3 


international    003    12a.O  892a.4 

organization    001    161n.O  329n.O  357n.O  357n.l  60n.O 

opposition  001  179n.O  237n.O  708n.O  710n.O  719n.O 

720n.O 

investment  001  225n.O  716n.2  784n.O  809n.3 

government  001  692n.3  693n.0  699n3  737n.l  737n5 

domination  001  175n.O  737n.l 

pressure  001  157n.2  175n.O  319n.O  642n3  735n.l 

identify  001  13v.O  464v.l  480av.5 

document  003  467n3  551n.2 

embargo  001  265n.2  761n.O 

discuss  001  298V.0  451v.O  460v3  476v.O 

boycott  002  297v.2 

another  001  104a.  1  15a.l  709v.O  714v.O 

against  004    14a.O  179v.O  237d.O  276v.O  673d.O 

704d.O  708d.O  708d.l  708v,0  708v.2 
713V.2  716V.1  716V.5  717a.0  719v.0 
764V.0  898a.O  932v.8 

united  001    46a.l  714a.O 

import  001    228v.l  296v.O  300v.O  516n.O  516v.O 

642n.l  642V.0 

exodus  001  293n.O  295v.l 

trade  002  625n.4  734v.l  794n.l  794v.l 

south  006  278n.l 

being  001  In.O  3n.0  831n.0  976n.2 

will  002   360V.1  600n.O  600v.0  602d.0  604a.0 

604n.0  737v3  771n.ll  784n.4 

new  001    123a.O  146v.O  18a.O  614a.0  66n.0 

ban  004   761n.O  908n.O  98n3 

any  001    25a.O  51n.O  609an.l 


Figure  3:  Word  Frequency  and  Semantic  Categories  for  Topic  052. 
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Figure  4:  A  Simple  Network  Consisting  of  Two  Document 
Nodes  and  Three  Index  Term  Nodes. 

receives  information  about  the  activation  levels  of  its 
immediate  neighboring  nodes  (nodes  connected  to  it  via 
direct  links),  and  then  uses  this  information  to  calculate  its 
own  activation  level  Through  this  process  of  spreading 
activation,  the  network  settles  down  to  equilibrium  repre- 
senting a  retrieval  to  a  user's  information  need. 

The  computation  of  the  information  retrieval  inference 
process  is  based  on  a  formalization  of  the  causal  and  proba- 
bilistic associative  knowledge  underlying  diagnostic  prob- 
lem-solving [18].  We  do  not  discuss  the  formulation 
architecture  and  activation  mechanism  of  the  connectionist 
model.  This  information  can  be  found  in  [11,13,16,18].  For 
TREC-2,  we  managed  to  complete  only  one  official  routing 
experiment  for  this  approach,  and  it  did  not  involve  semantics. 
The  experiment  was  intended  to  be  a  baseline  experiment  for 
our  semantic  experiments. 

For  TREC-2,  a  specific  network  was  constructed  for  50 
topics.  A  list  of  mdex  terms  was  assembled  based  on 
keywords  in  the  concept  section  of  each  topic.  In  this  network, 
each  output  node  represented  a  topic,  and  each  input  node 
represented  a  keyword.  The  prior  probability  assigned  to  each 
topic  node  was  equal  to  l/(total  number  of  topics).  The 
connection  strengths  were  assigned  equal  weights  (0.9). 

The  network  contained  50  topic  nodes  and  848  index  term 
nodes.  These  nodes  were  connected  via  1449  links.  An 
example  of  this  network  is  shown  in  Figure  5,  where  is  the 
prior  probability  of  topic  topi.  The  keywords  "army", 
"engineer",  and  "plant"  were  obtained  by  processing  the 
concept  section  of  topic  top,.  Currently,  the  network  is 
enhanced  by  using  an  estimated  weighting  scheme. 

We  performed  a  Category  B  routing  experiment.  Using 
just  keywords,  the  results  were  not  good.  The  main  problem 
was  due  to  the  fact  that,  in  the  document  ranking,  many 
documents  had  the  same  score  used  to  generate  the  ranking. 
In  order  to  satisfy  the  requirements  for  the  ranking,  we  had 
to  artificially  rank  those  documents  with  the  same  score.  This 
was  done  based  on  order  of  appearance.  The  performance 
was  terrible  except  for  Topic  66.  This  topic  had  only  two 


top  i 
pl=0.02 


(army )  (engineer )        (plant ) 


Figure  5:  A  Sample  Network  of  the  Experimental  Model 


known  relevant  documents  for  (Category  B  routing  experi- 
ments and  our  inference  network  retrieved  one  of  them  in  the 
top  20  documents!  No  further  connectionist  model 
experiments  have  been  completed.  We  were  unable  to  modify 
the  baseline  keyword  experiment  or  perform  semantic 
experiments  for  this  approach. 

4.  Vector  Processing  JModel  Experiments 

In  this  section,  we  explain  the  manner  in  which  semantics 
is  incorporated  within  a  vector  processing  model  using  the 
semantic  lexicon  explained  in  Section  2.  Please  note  that  an 
entry  in  our  semantic  lexicon  has  the  form  of  a  word  followed 
by  codes  for  each  of  the  semantic  categories  the  word  triggers. 
We  explain  our  approach  using  a  text  relevance  determination 
procedure  intended  to  show  what  is  being  calculated  rather 
than  show  the  actual  computations  for  the  approadi.  The 
procedure  presented  here  generates  several  outputs  that  are 
reaUy  not  necessary,  but  are  included  just  to  help  explain  the 
approach.  The  relevance  determination  procedure  is 
explained  using  the  four  documents  and  query  shown  in 
Figure  6.  A  few  preliminary  computations  are  reviewed  in 
order  to  explain  the  procedure. 

First,  the  number  of  documents  eadi  word  is  in  must  be 
determined.  Figure  7  shows  a  list  of  words  from  the  four 
documents  and  the  query  of  Figure  6  along  with  the  number 
of  documents  each  word  is  in  (df). 

Next,  the  Inverse  document  frequency  (idf)  of  each  word 
is  determined  by  the  equation  \ogiQ(N/df),  where  iV  »  4,  the 
total  number  of  documents.  Figure  8  provides  the  idf  of  each 
word.  Sometimes,  the  of  a  word  is  undefined.  This  can 
happen  when  a  word  does  not  occur  in  the  documents  but 
does  occur  in  a  query.  For  example,  the  words  "depart",  "do" 
and  "when"  do  not  appear  in  the  four  documents.  Thus,  the 
idf  of  these  terms  cannot  be  defined  here.  Later,  we  will  see 
that  an  adjustment  can  be  made  for  these  undefined  values. 

Next,  the  category  probability  of  each  query  word  is 
determined.  Figure  9  shows  an  alphabetized  list  of  all  the 
unique  words  from  the  query,  the  frequency  of  eadi  word  in 
the  query,  and  the  semantic  categories  each  word  triggers. 
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word  itff  of  the  word 

Document  #1  '°SiOd/ 

t| ;          .  Locomotives  pull  the  trains.                                                      and  .6 

canopy  .6 

carry  .6 

A.              Document  #2                                                                      depart  undefined 

[J            ^                                                                                       do  undefined 

1  ii              People  meet  people  under  the  canopy  and  within  trains.                       freight  .6 

from  .6 

;                                                                                                        hourly  .6 

iiJ    .          Document  #3                                                                      leave  .6 

i'jj:           ,                                                                                        locomotives  .6 

I  !             Trains  carry  freight  from  the  station.                                            meet  .6 

\-x                                                                                              noon  .6 

i  [ij                                                                                                  people  .6 

;' I              Document  #4                                                                      pull  .6 

station  .3 

Trains  leave  the  station  hourly  until  noon.                                      the  0 

{ill                                                                                                  trains  0 

!'''■!!,                                                                                                 under  .6 

111             Query                                                                              until  .6 

|;|,                                                                                                 when  undefined 

III    '          When  do  trains  depart  the  station?                                               within  .6 


Figure  6.  Four  Documents  and  a  Query.  Figure  8.  The  idf  of  Each  Word. 


word  number  of  documents 

the  word  is  in  (df) 


and 

1 

canopy 

1 

carry 

1 

do 

0 

depart 

0 

freight 

1 

from 

1 

hourly 

1 

leave 

1 

locomotives 

1 

meet 

1 

noon 

1 

people 

1 

puU 

1 

station 

2 

the 

4 

trains 

4 

under 

1 

until 

1 

when 

0 

within 

1 

Figure  7.  List  of  Words  in  the  Documents  and  Query. 


word 

frequency 

category 

probability 

depart 

1 

AMDR 

1/4 

TAMT 

1/8 

do 

1 

AUSE 

1/21 

ATMP 

1/21 

TCSE 

1/21 

TCNV 

2/21 

TRES 

1/21 

TSRC 

1/21 

station 

1 

APOS 

3/16 

AORD 

1/8 

TAMT 

1/16 

TCND 

1/8 

TDGR 

1/16 

TSPL 

3/16 

the 

1 

trains 

1 

AORD 

7/24 

AMDR 

1/12 

AMFR 

1/12 

TACM 

1/24 

TCNV 

1/12 

when 

1 

TAMT 

1/3 

TTIM 

2/3 

Figure  9.  Words  in  the  Query. 
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The  semantic  categories  in  our  example  are  those  shown 
in  Figure  1.  For  example,  consider  the  word  "depart"  which 
occurs  one  time  in  the  query  as  shown  in  Figure  9.  The 
semantic  lexicon  entry  for  the  word  "depart"  using  the 
categories  of  Figure  1  is  as  follows: 

depart:  NONE  NONE  NONE  NONE  NONE  AMDR 
AMDRTAMT 

where  NONE  represents  a  word  sense  not  included  in  the  36 
semantic  categories  of  Figure  1.  If  a  uniform  distribution  is 
assumed,  then  AMDR  is  triggered  1/4  of  the  time  and  TAMT 
is  triggered  1/8  of  the  time.  This  is  shown  in  Figure  9  as  the 
probabilities  for  each  semantic  category. 

A  similar  category  probability  determination  is  done  for 
each  document.  Figure  10  is  an  alphabetized  list  of  all  the 
unique  words  in  Document  #4  of  Figure  6.  The  semantic 
categories  each  word  triggers  along  with  probabilities  are  also 
shown. 

The  text  relevance  determination  procedure  is  shown  in 
Figure  11.  The  procedure  uses  three  input  lists: 

a.  List  of  words  and  the  idf  of  each  word,  as  shown  in  Figure 
8. 

b.  List  of  words  in  the  query  and  the  semantic  categories  they 
trigger  along  with  the  probability  of  triggering  those 
categories,  as  shown  in  Figure  9. 

c.  List  of  words  in  a  document  and  the  semantic  categories 
they  trigger  along  with  the  probability  of  triggering  those 
categories,  as  shown  in  Figure  10. 

The  procedure  operates  as  follows: 

Step  1. 

This  step  determines  the  common  meanings  between  the 
query  and  the  document.  Figure  12  corresponds  to  the  output 
of  Step  1  for  Document  #4.  In  Step  1,  a  new  list  is  aeated  as 
follows: 

For  each  word  in  the  query,  follow  either  subsection  (a)  or 
(b),  whichever  applies: 

a.  For  each  category  the  word  triggers,  find  each  word  in  the 
document  that  triggers  the  category  and  output  three  things: 

1)  The  word  in  the  query  and  its  frequency  of  occurrence. 

2)  The  word  in  the  document  and  its  frequency  of 
occurrence. 

3)  The  category. 

b.  If  the  word  does  not  trigger  a  category,  then  look  for  the 
word  in  the  document  and  if  found,  output  two  things  and 
a " — ": 

1)  The  word  in  the  query  and  its  frequency  of  occurrence. 

2)  The  word  in  the  document  and  its  frequency  of 
occurrence. 


word 

frequency 

category 

probability 

hourly 

1 

TDM 

1.0 

leave 

1 

AMDR 

W 

TAMT 

1/7 

noon 

1 

ALDM 

1/3 

'I'I'lX/f 
i  i  IM 

LI  J 

the 

1 





station 

1 

APOS 

3/16 

AORD 

1/8 

TAVTT 

1  /I  ^ 

TCNP 

1/8 

TDGR 

1/16 

TSPL 

3/16 

trains 

1 

AORD 

1  /I  9 

AMFR 

1/12 

TACM 

1/24 

TCNV 

1/12 

until 

1 

TTIM 

1.0 

Figure  10.  Words  in  Document  #4. 


Step  1  -  Refer  to  Figure  12. 
Determine  common  meaning 
between  query  and  the  document. 


Step  2  -  Refer  to  Figure  13. 
Adjust  for  words  in  the 
query  that  are  not  in  any 
of  the  documents. 


Step  3  -  Refer  to  Figure  14. 
Calculate  the  weight  of  a 
semantic  component  in  the  query 
and  calculate  the  weight  of  a 
semantic  component  in  the  document. 


Step  4  -  Refer  to  Figure  15. 
Multiply  the  weight  in  the  query 
by  the  weight  in  the  document. 


Steps  -  Refer  to  Figure  15. 
Sum  all  the  individual  products 
of  Step  4  into  a  single  value  whidi 
is  the  semantic  similarity  coefficient. 


Figure  11.  Relevance  Determination  Procedure  to  Explain 
Semantic  Similarity. 
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First  Ust 


Item 
Number 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 


First  Entry 
Word  &  Frequency 
in  Query 

(depart,!) 

(depart,!) 

(depart,!) 

(depart,!) 

(do,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(the,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(wiien,!) 

(when,!) 

(wlien,!) 

(when,!) 


Second  Entry 
Word  &  Frequency 
in  Document  #4 

(leave,!) 

(trains,!) 

(leave,!) 

(station,!) 

(trains,!) 

(station,!) 

(station,!) 

(trains,!) 

(leave,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(the,!) 

(trains,!) 

(leave,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(leave,!) 

(hourly,!) 

(noon,!) 

(until,!) 


Third  Entry 
Category 


AMDR 

AMDR 

TAMT 

TAMT 

TCNV 

APOS 

AORD 

AORD 

TAMT 

TAMT 

TCND 

TDGR 

TSPL 

AORD 

AMDR 

AMDR 

AMFR 

TACM 

TCNV 

TAMT 

TTIM 

TTIM 

TTIM 


Figure  12.  Ctommon  Meaning. 


Second  List 


Item 
Number 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 


First  Entry 
Word  &  Frequency 
in  Query 

(leave,!) 

(trains,!) 

(leave,!) 

(station,!) 

(trains,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(the,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(leave,!) 

(hourly,!) 

(noon,!) 

(until,!) 


Second  Entry 
Word  &  Frequency 
in  Document  #4 

(leave,!) 

(trains,!) 

(leave,!) 

(station,!) 

(trains,!) 

(station,!) 

(station,!) 

(trains,!) 

(leave,!) 

(station,!) 

(station,!) 

(station,!) 

(station,!) 

(the,!) 

(trains,!) 

(leave,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(trains,!) 

(leave,!) 

(hourly,!) 

(noon,!) 

(until,!) 


Third  Entry 
Category 


AMDR 

AMDR 

TAMT 

TAMT 

TCNV 

APOS 

AORD 

AORD 

TAMT 

TAMT 

TCND 

TDGR 

TSPL 

AORD 

AMDR 

AMDR 

AMFR 

TACM 

TCNV 

TAMT 

TnM 

TnM 

TnM 


Figure  13.  Adjustment  for  Words  with  no  idf. 
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Considering  Figure  12,  the  word  "depart"  occurs  in  the 
query  one  time  and  triggers  the  category  AMDR.  The  word 
"leave"  occurs  in  Document  #4  once  and  also  triggers  the 
category  AMDR.  Thus,  item  1  in  Figure  12  corresponds  to 
subsection  (a)  as  described  above.  An  example  using  sub- 
section (b)  occurs  in  item  14  of  Figure  12. 

Step  2. 

This  step  adjusts  for  words  in  the  query  that  are  not  in  any 
of  the  documents.  Figure  13  shows  the  output  of  Step  2  for 
Document  #4.  In  this  step,  another  list  is  aeated  from  the  list 
created  in  Step  1.  For  each  item  in  the  Step  1  list  which  has 
a  word  with  undefined  idf,  this  step  replaces  the  word  in  the 
First  Entry  column  by  the  word  in  the  Second  Entry  column. 
For  example,  the  word  "depart"  has  an  undefined  idf2&  shown 
in  Figure  8.  Thus,  the  word  "depart"  in  item  1  of  Figure  12 
should  be  replaced  by  the  word  "leave"  from  the  Second  Entry 
column.  This  is  shown  in  item  1  of  Figure  13.  Likewise,  the 
words  "do"  and  "when"  also  have  an  undefined  idf  and  are 
respectively  replaced  by  the  words  from  the  Second  Entry 
column. 

Step  3. 

This  step  calculates  the  weight  of  a  semantic  component 
in  the  query  and  calculates  the  weight  of  a  semantic  compo- 
nent in  the  document.  Figure  14  shows  the  output  of  Step  3 
for  Document  #4.  In  Step  3,  another  list  is  created  from  the 
list  created  in  Step  2  as  follows: 

For  each  item  in  the  Step  2  list,  follow  either  subsection  (a) 
or  (b),  whichever  applies: 

a.  If  the  Third  Entry  specifies  a  category,  then 

1)  Replace  the  First  Entry  by  computing: 


triggers  the  category 
in  the  Third  Entry  , 


2)  Replace  the  Second  Entry  by  computing: 


'  frequency  of 

1 

word  in 

• 

word  in 

• 

iFirst  Entiy^ 

^  First  Entry  , 

\ 

'  frequency  of 

/ 

word  in 

* 

word  in 

• 

^Second  Entiy^ 

^Second  Entry^ 

triggers  the  category 
in  the  Third  Entry  ^ 


3)  Omit  the  Third  Entry, 
b.  If  the  Third  Entry  does  not  specify  a  category,  then 
1)  Replace  the  First  Entry  by  computing: 


'  frequency  of 

word  in 

• 

word  in 

^First  Entry^ 

^  First  Entry  , 

2)  Replace  the  Second  Entry  by  computing: 


word  in 
^Second  Entry^ 


3)  Omit  the  Third  Entry. 


\  (  frequency  of 
word  in 
Second  Entry^ 


In  Figure  14,  item  1  is  an  example  of  using  subsection  (a), 
and  item  14  is  an  example  of  using  subsection  (b). 


Step  4. 

This  step  multiplies  the  weights  in  the  query  by  the  weights 
in  the  document.  The  top  portion  of  Figure  15  shows  tlie 
output  of  Step  4.  In  the  list  created  here,  the  numerical  value 
created  in  the  First  Entry  column  of  Figure  14  is  multiplied 
by  the  numerical  value  created  in  the  ^cond  Entry  column 
of  Figure  14. 

Step  5. 

This  step  sums  the  values  in  the  Step  4  list  to  compute  the 
semanticsimilarity  coefficient  for  a  particular  document.  The 
bottom  portion  of  Figure  15  shows  the  output  of  step  5  for 
Document  #4. 

We  have  finally  observed  an  improved  Precision/Recall 
f>erformance  using  the  semantic  similarity  coefficient 
explained  here.  For  example,  in  a  Category  B  filtering 
experiment  where  the  words  being  considered  were  only  those 
in  the  topics  and  idf  values  were  determined  by  the  number 
of  topics  a  word  is  in,  we  have  observed  the  keyword  and 
semantic  results  shown  in  Figure  16  and  Figure  17,  respec- 
tively. The  U-pt  average  for  these  two  experiments  reveals 
a  23%  increase  due  to  the  use  of  semantic  categories. 
According  to  Sparck  Jones'  criteria,  this  change  would  be 
classified  as  "significant"  (greater  than  10.0%)  [12].  We 
believe  further  improvement  is  possible  by  considering  more 
words,  stemming  for  plurals  and  tenses  of  words,  better  idf 
values  (hke  those  used  for  archival  retrieval),  a  modem 
lexicon,  and  a  focus  on  paragraphs  instead  of  whole  docu- 
ments. 

5.  Summary 

Our  progress  during  TREC-1  and  TREC-2  has  been  the 
following: 

a.  We  created  efficient  code  for  a  UNIX  platform.  Originally 
our  code  used  B+  tree  structures  for  implementing  inverted 
files  on  a  DOS  platform.  We  now  use  hashing  to  replace 
B+trees,  establishing  codes  to  replace  character  strings; 
and  the  UNIX  platform  provides  faster  processing  than  the 
DOS  platform. 

b .  We  built  an  index  for  a  sem  antic  lexicon  based  on  the  public 
domain  1911  version  of  Roget's  Thesaurus.  To  do  this, 
we  had  to  aeate  our  own  category  numbering  system 
similar  to  today's  version  of  Roget's  Thesaurus. 

c.  We  solved  part  of  the  blend  problem  for  semantic  and 
keyword  weights.  We  now  base  semantic  category  weights 
on  the  i<i/of  words  which  generate  the  semantic  categories. 

We  can  now  index  or  scan  TREC  documents  at  rates  faster 
than  60  Megabytes  per  hour  depending  on  the  workstation. 
We  have  a  semantic  lexicon  of  approximately  20,000  words 
with  flexible  category  codes  that  allow  a  course  (36  catego- 
ries) through  fine  (more  than  15,000  categories)  semantic 
analysis.  As  shown  in  Section  4,  our  procedure  for 
determining  relevance  is  based  on  the  senses  of  each  word. 
For  example,  using  the  vector  processing  model  and  the 
similarity  coefficient 

y.i 


299 


Third  list 


1  Number 

First  Entry 

Second  Entry 

1 

.6*  1*1/7  =  .0857 

.6*  1*1/7 -.0857 

2 

0*1*1/12  =  0 

0*1*1/12-0 

3 

.6*1*1/7-  .0857 

.6  *  1  *  1/7  -  .0857 

4 

.3  *  1  *  1/16  -  .0188 

.3  *  1  *  1/16  -  .0188 

5 

0*1*1/12-0 

0*1*1/12-0 

6 

.3  *  1  *  3/16  -  .0563 

.3  *  1  *  3/16  -  .0563 

7 

.3  *  1  *  7/24  -  .0875 

.3  *  1  *  7/24  -  .0875 

8 

.3  *  1  *  1/8  -  .0375 

0*1*7/24-0 

9 

.3  *  1  *  1/16  -  .0188 

.6*  1*1/7 -.0857 

10 

.3*  1*1/16 -.0188 

.3*  1*1/16 -.0188 

11 

.3*  1*1/8 -.0375 

.3  *  1  *  1/8  -  .0375 

12 

.3  *  1  *  1/16  -  .0188 

.3*  1*1/16 -.0188 

13 

.3  *  1  *  3/16  -  .0563 

.3  *  1  *  3/16  -  .0563 

14 

0*1=0 

0*1=0 

15 

0*1*7/24-0 

0*1*7/24-0 

16 

0*1*1/12-0 

.6*  1*1/7 -.0857 

17 

0*1*  1/12-0 

0*1*1/12-0 

18 

0*1*1/12  =  0 

0  *  1  *  1/12  -  0 

19 

0*1*1/24  =  0 

0*  1  *  1/24  =  0 

20 

0*1*1/12-0 

0*1*1/12  =  0 

21 

.6  *  1  *  1/7  -  .0857 

.6*  1*1/7 -.0857 

22 

.6  *  1  *  1.0  -  .6000 

.6*  1*1.0  =  .6000 

23 

.6  *  1  *  26  -  .4000 

.6*1  *2/3  =  .4000 

24 

.6*  1*1.0-. 6000 

.6*1*  1.0 -.6000 

Figure  14.  Weights  of  Semantic  Components. 


Fourth  List 
Item  Number  Value 


1 

.00734 

2 

0 

3 

.00734 

4 

.00035 

5 

0 

6 

.00317 

7 

.00734 

8 

0 

9 

.00170 

10 

.00035 

11 

.00141 

12 

.00035 

13 

.00317 

14 

0 

15 

0 

16 

0 

17 

0 

18 

0 

19 

0 

20 

0 

21 

.00734 

22 

36000 

23 

.16000 

24 

.36000 

Sum  of  all  values  in  Fourth  List 

0.91986 

Figure  15.  Multiplied  Weights  and  Their  Sum. 
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Queryid  (Num):  47  of  50 

Total  number  of  documents  over  all  queries 
Retrieved:  36610 
Relevant:  2064 
Rel_ret:  913 
Interpolated  Recall  -  Precision  Averages 


at  0.00 
at  0.10 
at  0.20 
at  0.30 
at  0.40 
at  0.50 
at  0.60 
at  0.70 
at  0.80 
at  0.90 
at  1.00 


0.3514 
0.1968 
0.1367 
0.1082 
0.0894 
0.0752 
0.0276 
0.0105 
0.0062 
0.0013 
0.0007 


Queryid  (Num):  47  of  50 

Total  number  of  documents  over  all  queries 
Retrieved:  36383 
Relevant:  2064 
Relret:  956 
Interpolated  Recall  -  Precision  Averages 


at  0.00 
at  0.10 
at  0.20 
at  0.30 
at  0.40 
at  0.50 
at  0.60 
at  0.70 
at  0.80 
at  0.90 
at  1.00 


0.3961 
0.2479 
0.1734 
0.1258 
0.1067 
0.0838 
0.0372 
0.0195 
0.0100 
0.0029 
0.0009 


Average  precision  (non-interpolated)  over  all  rel  docs 

0.0746 

Precision: 


Average  precision  (non-interpolated)  over  aU  rel  docs 

0.0919 

Precision: 


At 

5  docs: 

0.1660 

At 

5  docs 

0.2426 

At 

10  docs: 

0.1532 

At 

10  docs 

0.2149 

At 

15  docs: 

0.1433 

At 

15  docs 

0.1801 

At 

20  docs: 

0.1298 

At 

20  docs 

0.1574 

At 

30  docs: 

0.1057 

At 

30  docs 

0.1383 

At 

100  docs: 

0.0643 

At 

100  docs 

0.0745 

At 

200  docs: 

0.0465 

At 

200  docs 

0.0522 

At 

500  docs: 

0.0302 

At 

500  docs 

0.0320 

At 

1000  docs: 

0.0194 

At 

1000  docs 

0.0203 

R-Precision  (precision  after  R  (=  num_rel  for  a  query) 
docs  retrieved): 

Exact:  0.1035 


R-Precision  (precision  after  R  (=  num_rel  for  a  query) 
docs  retrieved): 

Exact:  0.1283 


Figure  16.  Filtering  Using  Keywords. 


Figure  17.  Filtering  Using  Semantic  Categories. 


if  the  word  "trains"  is  in  the  Query  and  the  word  "leaves"  is 
in  the  Document  and  we  look  at  the  semantic  category  Motion 
with  Reference  to  Direction  (AMDR),  then  one  of  the  vector 
product  elements  in  the  formula  becomes: 

(weight  of  *tnint*|  /         prabability        \  , /weight  of  *leaves'\  /  ptobabilty  \ 

in  Qaeiy      )  i'tniu'  trigger*  AMDR  j    I    in  Document   }  \  'leaves'  triggers  AMDR  ) 

where  the  probabilities  are  obtained  from  our  semantic  lexi- 
con. 

We  plan  to  do  more  experiments  incorporating  the  fol- 
lowing improvements: 

a.  Modernize  the  semantic  lexicon.  Since  our  lexicon  is  based 
on  the  1911  version  of  Roget's  Thesaurus,  many  modem 
words  are  not  present  and  the  senses  of  recorded  words  are 
not  accurate.  We  plan  to  correct  this.  For  example,  we 
could  try  to  get  permission  to  use  the  current  version  of 
Roget's  Thesaurus. 

b.  Base  similarity  on  paragraphs  instead  of  whole  documents. 
We  have  bad  success  using  as  few  as  36  categories  in  a 
paragraph  environment.    We  also  feel  that  relevance 


decisions  are  made  by  humans  looking  at  roughly  a 
paragraph  of  information.  We  plan  to  modify  our  code  to 
use  paragraphs  as  a  basis  for  the  similarity  measure. 

c.  Experiment  with  the  number  of  possible  semantic  cate- 
gories and  the  probability  assigned  to  a  triggered  category. 
The  experiment  behind  the  performance  improvement 
shown  in  Figure  16  and  Figure  17  uses  a  very  fine  number 
of  semantic  categories  and  treats  the  triggered  semantic 
categories  for  a  word  uniformly.  We  plan  to  experiment 
with  a  fewer  number  of  categories,  and  we  plan  to  obtain 
a  probability  distribution  for  categories  based  on  word 
usage. 

Basically,  we  are  trying  to  establish  a  statistically  sound 
approach  to  using  word  sense  information.  Intuition  is  that 
word  sense  information  should  improve  retrieval  perform- 
ance. Furthermore,  our  approach  to  using  word  sense  infor- 
mation has  shown  a  significant  performance  improvement  in 
a  question/answer  environment  where  paragraphs  represent 
documents.  We  feel  that  other  word  sense  approadies,  such 
as  query  expansion  or  word  sense  disambiguation,  may  not 
be  statistically  sound,  and  that  may  be  why  successful 
experiments  have  not  been  reported. 
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Although  most  groups  participating  in  TREC-2  em- 
pheisized  precision  and  recall,  the  conference  was  also 
an  appropriate  forum  in  which  to  discuss  the  efficiency 
of  document  indexing  and  retrieval.  For  some  partici- 
pants, running  out  of  disk  space  was  their  worst  prob- 
lem, while  for  others  running  out  of  time  Wcis.  Some 
groups  were  unable  to  run  on  the  entire  collection,  and 
several  remarked  that  they  were  happy  to  have  gotten 
anything  running  at  all.  No  group  reported  finding 
TREC  trivial. 

Two  discussion  groups  were  organized  to  address  ef- 
ficiency issues,  one  focusing  on  document  indexing,  the 
other  on  document  retrieval.  Since  efficiency  has  not 
been  emphasized  by  the  research  IE  community,  par- 
ticipants in  both  groups  felt  that  current  cdgorithms 
have  a  lot  of  room  for  improvement.  TREC  provides 
one  motivation  for  such  improvements,  as  do  ever- 
growing real  world  databases.  However,  there  was  con- 
cern in  both  groups  that  the  TREC  format,  which  en- 
courages participation  in  both  ad  hoc  (retrospective) 
and  routing  (filtering)  tasks,  might  discourage  research 
on  efficient  task-specific  architectures. 

The  following  sections  provide  more  detail  on  each 
of  the  two  discussion  groups. 

1    Document  Indexing 

The  raw  text  for  the  TREC  collection  (routing  and 
ad  hoc)  required  approximately  3  GB  of  space.  Index 
structures  required  from  7.550themselves  sufficient  to 
recreate  the  original  text,  so  they  would  be  additional 
overhead  in  an  operational  system.  (A  research  system 
might  be  able  to  discard  the  original  text,  reporting 
just  document  ids  for  evaluation.) 

Several  groups,  including  CITRI,  Thinking  Ma- 
chines, and  UMass,  stored  inverted  lists  in  compressed 
form.  There  was  general  agreement  that  for  sites  will- 
ing to  invest  the  programming  eflTort,  substantial  space 
savings  could  be  achieved  in  this  fashion.  (CITRI 
demonstrated  a  factor  of  six  reduction  in  index  file 
size.)  There  was  more  debate  on  the  potential  for  in- 
dex compression  speeding  up  query  processing  as  well, 
with  some  peirticipants  saying  their  query  processing 
was  I/O  bound,  but  others  saying  theirs  was  CPU 
bound.  The  peak  amount  of  space  used  during  index- 
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Room  2C409 
Murray  HiU,  NJ  07974,  USA 
(lewis@reseeurch.att.com) 

ing  (for  text,  indices,  and  auxilieiry  files)  varied  from 
llOthat  this  may  be  worth  more  attention. 

Efficiency  improvements  will  not  come  immediately, 
and  some  may  require  significant  expense  in  program- 
ming time.  Shciring  of  softwcire  between  groups,  which 
increased  from  TREC-1  to  TREC-2,  helps  limit  this 
expense.  In  addition,  TREC  research  groups  primar- 
ily interested  in,  say,  query  analysis,  may  in  the  future 
want  to  team  up  with  groups  that  have  addressed  or 
are  addressing  issues  of  scale.  As  the  size  of  test  col- 
lections increases,  it  makes  less  sense  to  have  them 
replicated  at  dozens  of  sites,  particular  when  interac- 
tive access  across  networks  is  usually  easily  available. 

Time  to  build  index  structures  was  tolerable  for 
most  participants,  though  it  was  mentioned  that  it 
never  seems  to  go  as  easily  or  automatically  as  one 
might  hope.  It  was  a  serious  issue  for  groups  doing 
some  form  of  natural  language  processing.  Times  of 
2  to  4  MB/hr  were  mentioned  by  at  least  two  of  the 
NL  groups,  making  TREC  a  very  daunting  task.  The 
opinion  was  expressed  by  some  participants  that  an 
NL  technique  would  have  to  provide  as  yet  undemon- 
strated  improvements  in  effectiveness  to  be  worth  the 
slowdown  in  indexing  and  query  processing. 

Complex  text  representations,  such  as  those  pro- 
duced by  NL,  require  additional  information  in  the 
index  structures.  The  different  ways  of  dealing  with 
this  problem  axe  most  noticeable  in  the  handling  of 
phrases  in  the  TREC,  with  some  groups  indexing  on 
phrases  just  as  on  words,  others  relying  on  word  po- 
sition information  stored  in  an  inverted  fUe,  and  still 
others  reparsing  the  raw  text  of  a  subset  of  documents 
at  retrieval  time  to  find  phrase  occurrences. 


2    Document  Retrieval 

The  second  discussion  group  was  organized  around 
general  issues  in  document  retrieval.  Participants  were 
encouraged  to  use  their  experience  with  the  one  and 
two  gigabyte  TREC  collections  to  forecast  the  issues 
that  will  arise  when  collections  are  larger  and  more 
distributed.  Two  issues  dominated  the  discussion. 
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2.1    Will  Existing  Methods  Scale? 

Recent  trends  in  information  retrieval  are  towards  gi- 
gabyte and  terabyte  document  collections,  retrieval  by 
subsections/paragraphs,  and  multiple  representations 
of  document  content.  The  first  focus  questions  were 
whether  existing  storage  and  retrieval  methods  can 
cope  with  this  explosion  of  information,  and  if  new 
methods  are  needed,  what  might  they  be? 

Participants  identified  two  approaches  as  currently 
dominating  IR: 

1.  Search  all  available  subcollections  (e.g.  the  TIP- 
STER/TREC  task),  or 

2.  have  a  user  specify  which  subcoUection  to  secirch 
(e.g.  commercial  systems). 

Both  approaches  require  modification  if  they  axe  to 
scale  up. 

One  problem  with  the  "search  everything"  approach 
is  that  the  growth  rate  of  online  information  was  felt 
to  exceed  the  growth  rate  in  computer  performance. 
Even  if  a  collection  is  distributed  across  multiple  pro- 
cessors, detailed  consideration  of  every  document  may 
be  too  expensive  when  an  IR  system  faces  a  terabyte  of 
data.  One  solution  is  to  do  a  fast  first-pass  retrieval  to 
produce  a  reduced  set  of  documents  for  more  detailed 
consideration.  This  first  pass  might  involve  generat- 
ing approximate  scores  for  each  document  (e.g.  ETH 
in  TREC-2),  scoring  documents  based  on  a  smaller 
amount  of  text  (e.g.  abstracts  or  introductions),  or 
scoring  cluster  centroids  rather  than  individual  doc- 
uments. 

One  problem  with  the  "user  chooses  subcoUections" 
approach  is  that  the  task  wiU  be  overwhelming  when 
there  are  many  subcollections.  A  significant  portion 
of  users  of  one  commercial  service  already  choose  to 
search  everything  rather  than  select  subcollections. 
The  system  wiU  have  to  provide  assistance  if  user  se- 
lection is  to  be  viable.  If  subcollections  can  be  charac- 
terized succinctly,  perhaps  by  centroid  vectors,  auto- 
matically generated  thesauri,  or  controlled  vocabulary 
terms  (assigned  manually  or  automatically),  then  one 
could  use  the  query  to  rank  subcollections,  and  then 
search  only  the  top-ranked  subcollections.  Other  ap- 
proaches include  assistance  by  an  expert  system,  or 
browsing  interfaces  for  hyperlinked  subcollections. 

Global  statistics  that  summarize  some  aspect  of  a 
collection  (e.g.  idj)  were  expected  to  be  a  problem  for 
searching  multiple  subcollections  and  distributed  doc- 
ument collections.  If  a  collection  is  formed  at  indexing 
time,  statistics  can  be  gathered  and  saved  when  in- 
dices are  bmlt.  If  a  single  processor  performs  retrieval, 
statistics  can  be  gathered  during  retrieval.  However, 
if  multiple  subcollections  or  processors  cire  involved,  it 
is  less  clear  how  to  compute  global  statistics.  Meth- 
ods that  rely  only  on  local  statistics  that  summarize  a 
document  (e.g.  Berkeley  in  TREC-2)  offer  a  computa- 
tional advantage  in  this  environment. 


2.2    Specialized  Hardware  and  Software 

A  related  question  posed  by  increasingly  laxge  and  dis- 
tributed text  databases  is  whether  existing  hardware 
and  software  platforms  wiU  be  up  to  the  challenge.  The 
consensus  among  participants  was  that  conventional 
architectures  wiU  suffice,  because  IR  is  a  data-parallel 
task  that  lends  itself  to  distributed  computation.  Par- 
ticipants also  felt  that  they  had  generally  ignored  is- 
sues of  efficiency,  and  could  increase  their  speeds  if 
necessary.  There  was  little  support  for  supercomput- 
ers, massively  parallel  computers,  or  specialized  archi- 
tectures. 

One  could  argue  that  the  participants  were  biased 
towards  conventional  hardware  and  software  by  their 
own  need  for  flexibility,  their  small  budgets,  their  insu- 
lation from  the  time  constraints  of  real  users,  and  their 
desire  not  to  think  about  'systems'  issues  not  directly 
relevant  to  their  research.  However,  the  recent  fielding 
of  ranked  retrieval  systems  using  conventional  main- 
frames by  some  of  the  largest  online  vendors  provides 
additional  support  for  their  views  that  conventional 
architectures  wiU  suffice. 
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Workshop  Report 
Use  of  training  materials  in  constructing  routing  queries 


William  S.  Cooper 
Stephen  £.  Robertson 


The  participants  outlined  various  methods  of  exploiting  the  training  data  for  routing  retrieval  that  had  been  used  in 
the  conference.  In  all  cases  the  data  had  been  used  in  a  topic-  specific  manner;  i.e.  each  query  was  constructed  or 
expanded  using  relevance  judgements  for  that  particular  topic  only. 

In  some  systems,  terms  taken  direcdy  from  the  topic  were  weighted  or  reweighted  using  the  training  data.  In  others, 
terms  taken  from  the  training  documents  relevant  to  the  topic  were  used  in  addition  to  topic  terms,  or  were  used  instead 
with  the  original  topic  terms  playing  no  part.  In  a  few  cases,  terms  from  both  relevant  and  non-relevant  documents  were 
added,  the  latter  with  negative  weights.  Relevant  documents  with  a  high  preliminary  retrieval  ranking  coefficient  were 
preferred  as  a  source  of  expansion  terms  ua  one  system.  Probabilistic,  feedback  and  ad-hoc  methods  had  all  been  tried  as 
ways  of  modifying  the  query  in  response  to  the  training  data. 

How  far  might  a  query  profitably  be  expanded  on  the  basis  of  the  training  data?  Though  this  question  was  not 
answered  definitively,  some  participants  indicated  a  greater  willingness  to  consider  drastic  expansion  than  had  been 
thought  advisable  before  TREC  2. 

The  sample  of  relevance  judgements  for  TREC  2  was  thought  to  be  adequate  in  size  and  not  imreaUstically  large.  It 
was  sufficiendy  representative  in  its  inclusiveness  of  feedback  generated  from  a  wide  variety  of  systems.  However,  tiiis 
variety  indicates  a  possible  lack  of  realism,  m  tiiat  a  real  system  would  probably  have  access  only  to  relevant  documents 
retrieved  by  itself.  Thus  the  use  of  only  those  relevant  documents  found  in  a  search  on  the  training  data  by  the  system  in 
question  might  be  regarded  as  more  realistic. 
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APPENDIX  A 


This  appendix  contains  tables  of  results  for  all  the  TREC-2  participants.  The  tables  in  Appendix  A  show 
various  measwes  of  the  performance  on  the  adhoc  and  routing  tasks.  The  adhoc  results  come  first,  followed  by 
the  routing  results,  with  the  tables  in  the  same  order  as  the  presentation  order  of  the  papers.  The  definitions  of 
the  evaluation  measures  are  given  in  the  Overview,  section  4.  and  readers  unfamiliar  with  these  measures  should 
read  that  section  first. 

Care  should  be  taken  in  comparing  the  tables  across  systems.  These  measiues  show  performance  only,  with 
no  measure  of  user  or  system  effort. 

Each  table  contains  four  major  boxes  of  statistics  and  three  graphs. 
Box  1  "  Summary  Statistics 

line  1  "  unique  run  identifier,  data  subset,  and  query  construction  method  used 
Data  subset 

full  (disks  1  and  2  for  adhoc,  disk  3  for  routing) 

category  B  (the  official  subset  of  data,  1/4  of  the  data  using  the  Wall  Street  Joumal  articles  for  adhoc 
and  the  San  Jose  Mercury  News  articles  for  routing) 

Query  construction  method 
automatic 
manual 

feedback  (frozen  evaluation  used) 
line  2  ~  Number  of  topics  included  in  averages. 

line  3  ~  Total  number  of  documents  retrieved  over  all  topics.  Here,  "retrieved"  means  having  a  rank  less  1001. 

line  4  ~  Total  nimiber  of  relevant  documents  for  all  topics  in  the  collection  (whether  retrieved  or  not). 

line  5  --  Total  niunber  of  relevant  retrieved  documents  for  this  run. 

Box  2  —  Recall  Level  Averages 

lines  1-11  ~  The  average  over  all  topics  of  the  precision  at  each  of  the  11  recall  points  given.  Note 
that  this  is  interpolated  precision:  e.g..  for  a  particular  topic,  if  the  precision  at  0.50 
recall  is  greater  than  the  precision  at  0.40  recall,  then  the  precision  at  0.50  recall 
is  used  for  both  the  0.50  and  0.40  recall  levels. 

line  12  —    The  average  precision  as  calculated  in  a  non-interpolated  maimer 
(see  section  4  of  the  Overview  for  details  on  this  calculation). 

Box  3  ~  Document  Level  Averages 

lines  1-9  ~  The  average  recall  and  precision  after  the  given  number  of  documents  have  been  retrieved. 

line  10  -  the  R  precision.  This  is  a  new  evaluation  measure  being  tried  that 

averages  the  precisions  found  for  each  topic  at  the  document  level  of  R, 
where  R  is  the  number  of  relevant  documents  for  that  topic. 
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Box  4  -  Fallout  Level  Averages 

lines  1-11  "  The  average  recall  (probability  of  detection)  at  a  fixed  fallout  (probability  of  false  alarm  rate). 
Ten  equally  spaced  fallout  points  in  the  high  precision  end  of  the  curve  were  used,  and  for 
each  topic,  the  highest  recall  value  in  the  region  surrounding  that  fallout  point  is 
selected.  The  table  shows  the  averages  of  these  points  across  all  topics. 

Graph  1  --  Recall-Precision  C!urve 

This  is  a  plot  of  die  data  shown  in  Box  2. 
(jraph  2  --  Fallout-Recall  CXirve 

This  is  a  plot  of  the  data  shown  in  Box  4. 
Graph  3  ~  Normal  Deviate  -  Fallout-Recall 

This  is  a  plot  of  the  data  shown  in  Box  4.  but  plotted  on  probability  scales. 
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APPENDIX  B 


This  appendix  contains  charts  created  from  the  supplemental  forms  filled  out  by  each  group  about  their  sys- 
tem. These  charts  are  meant  to  supplement  the  papers  and  contain  a  standarded  and  formatted  description  of 
system  features  and  timing  aspects. 
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Individual 
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Average 
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See  IB 
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was  about 
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are  survey  articles  on  topics  closely  related  to  the  Institute's  technical  and  scientific  programs. 
Issued  six  times  a  year. 


NISI 


Nonperiodicals 


Monographs  —  Major  contributions  to  the  technical  literature  on  various  subjects  related  to  the 
Institute's  scientific  and  technical  activities. 

Handbooks  — Recommended  codes  of  engineering  and  industrial  practice  (including  safety  codes) 
developed  in  cooperation  with  interested  industries,  professional  organizations,  and  regulatory 
bodies. 

Special  Publications  — Include  proceedings  of  conferences  sponsored  by  NIST,  NIST  annual 
reports,  and  other  special  publications  appropriate  to  this  grouping  such  as  wall  charts,  pocket 
cards,  and  bibliographies. 

Applied  Mathematics  Series  — Mathematical  tables,  manuals,  and  studies  of  special  interest  to 
physicists,  engineers,  chemists,  biologists,  mathematicians,  computer  programmers,  and  others 
engaged  in  scientific  and  technical  work. 

National  Standard  Reference  Data  Series  — Provides  quantitative  data  on  the  physical  and  chemical 
properties  of  materials,  compiled  from  the  world's  literature  and  critically  evaluated.  Developed 
under  a  worldwide  program  coordinated  by  NIST  under  the  authority  of  the  National  Standard 
Data  Act  (Public  Law  90-396).  NOTE:  The  Journal  of  Physical  and  Chemical  Reference  Data 
(JPCRD)  is  published  bimonthly  for  NIST  by  the  American  Chemical  Society  (ACS)  and  the 
American  Institute  of  Physics  (AIP).  Subscriptions,  reprints,  and  supplements  are  available  from 
ACS,  1155  Sbcteenth  St.,  NW,  Washington,  DC  20056. 

Building  Science  Series  — Disseminates  technical  information  developed  at  the  Institute  on  building 
materials,  components,  systems,  and  whole  structures.  The  series  presents  research  results,  test 
methods,  and  performance  criteria  related  to  the  structural  and  environmental  functions  and  the 
durability  and  safety  characteristics  of  building  elements  and  systems. 

Technical  Notes  — Studies  or  reports  which  are  complete  in  themselves  but  restrictive  in  their 
treatment  of  a  subject.  Analogous  to  monographs  but  not  so  comprehensive  in  scope  or  definitive 
in  treatment  of  the  subject  area.  Often  serve  as  a  vehicle  for  final  reports  of  work  performed  at 
NIST  under  the  sponsorship  of  other  government  agencies. 

Voluntary  Product  Standards  — Developed  under  procedures  published  by  the  Department  of 
Commerce  in  Part  10,  Title  15,  of  the  Code  of  Federal  Regulations.  The  standards  establish 
nationally  recognized  requirements  for  products,  and  provide  all  concerned  interests  with  a  basis 
for  common  understanding  of  the  characteristics  of  the  products.  NIST  administers  this  program 
in  support  of  the  efforts  of  private-sector  standardizing  organizations. 

Consumer  Information  Series  — Practical  information,  based  on  NIST  research  and  experience, 
covering  areas  of  interest  to  the  consumer.  Easily  understandable  language  and  illustrations 
provide  useful  background  knowledge  for  shopping  in  today's  technological  marketplace. 
Order  the  above  NIST  publications  from:  Superintendent  of  Documents,  Government  Printing  Office, 
Washington,  DC  20402. 

Order  the  following  NIST  publications  — FIPS  and  NISTIRs—from  the  National  Technical  Information 
Service,  Springfield,  VA  22161. 

Federal  Information  Processing  Standards  Publications  (FIPS  PUB) —  Publications  in  this  series 
collectively  constitute  the  Federal  Information  Processing  Standards  Register.  The  Register  serves 
as  the  official  source  of  information  in  the  Federal  Government  regarding  standards  issued  by 
NIST  pursuant  to  the  Federal  Property  and  Administrative  Services  Act  of  1949  as  amended. 
Public  Law  89-306  (79  Stat.  1127),  and  as  implemented  by  Executive  Order  11717  (38  FR  12315, 
dated  May  11,  1973)  and  Part  6  of  Title  15  CFR  (Code  of  Federal  Regulations). 
NIST  Interagency  Reports  (NISTIR)  — A  special  series  of  interim  or  final  reports  on  work 
performed  by  NIST  for  outside  sponsors  (both  government  and  non-government).  In  general, 
initial  distribution  is  handled  by  the  sponsor;  public  distribution  is  by  the  National  Technical 
Information  Service,  Springfield,  VA  22161,  in  paper  copy  or  microfiche  form. 
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